https://doi.org/10.31449/inf.v46i2.3603 Informatica46 (2022) 291–298 291 DeepReinforcementLearning-basedAnomalyDetectionforVideo Surveillance Sabrina Aberkane and Mohamed Elarbi-Boudihir E-mail: s.aberkane@esi-sba.dz, m.elarbiboudihir@esi-sba.dz ESI-SBA, High school of Computer Sciences, Sidi Bel-Abbès, Algeria Studentpaper Keywords: deep reinforcement learning, anomaly detection, video surveillance Received: June 22, 2021 The anomaly detection in automated video surveillance is considered as one of the most critical tasks to be solved, in which we aim to detect a variety of real-world abnormalities. This paper introduces a novel approach for anomaly detection based on deep reinforcement learning. In recent years, deep reinforce- ment learning has been achieving a significant success in various applications with data of a high degree of complexity such as robotics and games, by mimicking the way humans learn from experiences. Generally, the state-of-the-art methods classify a video as normal or abnormal without pinpointing the exact location of the anomaly in the input video due to the unlabeled clip-level data in training videos. We focus on adapting the prioritized Dueling deep Q-networks to the anomaly detection problem. This model learns to evaluate the anomaly in video clips by exploiting the video-level label to obtain a better detection accuracy. Extensive experiments on 13 cases class of real-word anomaly show that our DRL agent achieved a near optimal performance with a high accuracy in the real world video surveillance system compared to the state-of-the-art approaches. Povzetek: Razvita je nova metoda globokega spodbujevalnega uˇ cenja za prepoznavanje anomalij pri videonadzoru. 1 Introduction In video surveillance systems, the ability to recognize ac- tions can be used to detect and prevent abnormal or sus- picious events. Such intelligent systems would be greatly helpful for providing security to people. Indeed, the surveillance cameras also make some people feel more safe, knowing that the culprits are being watched. Gen- erally, these kinds of systems are powered by different algorithms[1, 2, 3], which are action recognition, object tracking and object classification. The conception of such algorithms is typically addressed in computer vision re- search which works on how to make machines gain some human understanding of data from digital images and videos. In this work, we focus on designing an intelligent visual surveillance system, which aims to detect abnormalities in urban places. The anomaly detection task has been one of the most talked about issues for decades, and is still a very hot topic due to the broad real-world applications including visual surveillance. To address the abnormality detection problem, some researchers attempted to give a general def- inition, which covers all existing normal/abnormal motions in daily life. Otherwise, many workers considered the task as an activity classification problem. All these researches share one main purpose which is to build an intelligent ma- chine imitating the human capability of interpreting com- plex human behaviors in a cluttered environment. Is it pos- sible that a machine could perform the recognition task at the same level as humans? In our paper, we try to answer the question above by demonstrating that a machine can be as efficient as a hu- man as long as it succeeds in reproducing the human’s native learning mechanism. Indeed, we consider build- ing an agent able to learn from the environment through a sequence of trial/error. The video analytics framework takes a video clip as input, then the pre-trained agent will provide two principal elements separately: the first is an estimation of the existence of an abnormal content in the video. The second, indicates the anomaly score for each segment in the video. The system architecture is inspired by a trending approach called deep reinforcement learning, which is a branch of machine learning based on the con- cept that an agent learns from interacting with an environ- ment. The agent training was done through a new large- scale dataset of 1900 videos, 128 hours long, untrimmed real-world surveillance footage, with 13 cases of realistic abnormalities. The organization of this paper is given as follows: After the introduction in this section, we will present the state- of-the-art anomaly detection approach in section 2. Subse- quently, we introduce how to implement the system using the Dueling DQN as well as the anomaly localization in section 3. Section 4 will present the results and conclu- 292 Informatica46 (2022) 291–298 S. Aberkane et al. System Techniques Scene Localization Dataset(s) Accuracy Schuldt et al [4] Patterns Represtation SVM classification Outdoor Indoor Uncrowded Disable Action database (available on request) 71,70 % Hu et al [6] Modeling trajectories Cluster-based Outdoor crowded traffic Disable Action database (available on request) 80% Qiao et al [7] Modeling Optical flow Deep autoencorder Outdoor Indoor Disable Lawn indoor plaza 98.33% Khaleghi et al [8] deep learning Outdoor Indoor Enable UCSD dataset 88.1% Shean Chong et al [9] Spatiotemporal architecture Convolutionnel network Autoencoder Outdoor Crowded Enable UCSD dataset 89.9% Hasan et al [10] Learning pattern model Autoencoder Outdoor Indoor Crowded Enable CUHK Avenue UCSD Ped1 UCSD Ped2 Subway Entrance Subway Exit 70,2% 81.0% 90.0% 94.3% 80.7% Sultani et al [11] Multiple instance learning deep learning Outdoor Crowded Enable UCF Dataset 75.41% Oh et al [12] Reinforcement learning - Disable GeoLife GPS TST 35% 93% Table 1: The comparison of properties between state-of-the-art approaches. sions are finally made in Section 5. 2 Relatedwork The initial studies on anomaly detection have been reported in [4, 5, 6, 7], where the systems model the normal motion of individuals as trajectory, the anomaly is detected as a deviation from that normal trajectory. More recently, the following works used deep learning, which achieves com- petitive performances in video data. In this paper [8] a deep learning-based technique is used on both features extrac- tion phase and rare events detection phase. The authors in [9] employ a spatiotemporal autoencoder to design a frame- work for events detection, which is composed of both spa- tial feature representation and the learning temporal evolu- tion of the spatial features. Hasan et al [10] also used deep learning with autoencoders to present a fully connected au- toencoder to learn the model of anomaly detection. To learn anomalous events, Waqas et al [11] constructed a new framework based on a deep multiple instance learn- ing which leverages weakly labeled training videos. The authors in [12] applied the inverse reinforcement learning (IRL) for sequential anomaly detection, the system cap- tures the sequence of actions of a target agent as input data, to return observation and evaluate whether it follows a nor- mal pattern or not. The proposed approach works with a reward function which is inferred via IRL. Table 1 compares the properties between previous sys- tems. ’Scene’ indicates where the anomaly occurs and the number of individuals on-site(crowded, uncrowded); ’Localization’ specifies the property to locate where the anomaly is occurring. The ’scene’ column of [12] is in- dicate as (-), because the used dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. 3 Systemmodelisation In this paper, we formulate the anomaly detection as a sequential decision-making process, then we propose the concept of a deep anomaly detection network to estimate the probability of covering an abnormality for each video segment. We assume that for a given video only a small number of segments contain the anomaly. Hence, we em- ploy the reinforcement learning approach to train our de- tection network, which encourages high scores anomalous video segments as compared to normal segments. The lat- ter is the equivalent of going through the process of finding the N segments with the highest abnormality scores from an input video. Figure 2: A sample of a distribution of anomaly segments in an abnormal video(red). Deep reinforcement learning offers two different struc- Deep Reinforcement Learning-based Anomaly. . . Informatica46 (2022) 291–298 293 Figure 1: The system selection process. tures where a machine can teach itself based on the results of its actions. One is the Deep Q-network DQN structure which only relies on the evaluation of actions to make de- cisions, while the other is the Deep Dueling Q-network DDQN that unlike the first, it takes advantage of both ac- tion value and environment information. We adopted the dueling structure that was introduced by Wang et al [13], which explicitly sets apart the representa- tion of state value and the state-dependent actions advan- tage via two separate streams. Q(s,a) =V(s)+A(s,a) (1) A(s,a) denotes the advantage stream that outputs a vec- tor having dimensionality equal to the number of actions, representing the value of selecting an actiona i at states t . V(s) denotes the value stream which outputs a scalar to represent the value of states t . The Value of a state is inde- pendent of actions. Both streams are combined at the end to produce the Q-function estimate through the combining module that can simply aggregate the value and the advan- tage streams as in [14]. The final output is a set ofQ values Q(s,a), one for each action. As known, The markov decision process MDP is the underlying basis of any Reinforcement Learning model. Thus, to train the reinforcement learning agent for detect- ing abnormal events or behaviors on a video, it’s very cru- cial to structure the environment of the processed video to the agent in some way to satisfy the Markov property, which is defined by a tuple(S,A,P,R,γ ) such as statesS, actionsA, etc. State: we consider that an input video V contains N segmentx i whereV = P N i=1 x i . The state of the environ- ment is a set of video segmentsS = [x V 1 ,x V 2 ,x V 3 ,...,x V N ]. Where the initial state S 0 is composed of N = 32 seg- ments, and for the next states, the initial number of seg- ments decreases until it reaches the minimum number al- lowedN = 5, which is defined as the terminal state. Action: An actiona i ∈ A is every action executed by the RL agent to achieve its final objective, which is find- ing the set of segmentsx i that are covering the anomaly to trigger an alarm. Reward: In the literature, we find reward/penalization functions that are quite simple, as used in games [15], oth- ers much less simple like the one defined in [16]. Gener- ally, in more complex environments, the reward function must be designed in such a way to suit the agent’s environ- ment to reflect what the agent has actually learned from the last episodes. We can describe our environment as real, complex and unpredictable. In this context we based the reward- ing/penalizing scheme on two axes: – Actions A related to segments X from normal videos V n : The reward value r(V n ,a i ) reached the maximum when Q(s t ,a i ) = 0. The agent is penal- ized when Q(s t ,a i ) = 1, this means that the action is judged corresponding to the database annotations only. – Actions A related to segments X from abnormal videosV ab : The reward valuer(V ab ,a i ) reached the maximum when Q(s t ,a i ) is similar to Expert E c evaluationQ(s t ,a i ) = E c (s t ,a i ), and the minimum when|Q(s t ,a i )− E c (s t ,a i )| = 1. 3.1 Selectionprocess Two different schemes are assumed to isolate the accu- rate anomaly representative segments. One is to directly score each segment and then consider the most representa- tive one that is judged by the highest anomaly value from each video. The other is to remove the worst segment that is judged by the lowest anomaly value gradually, and the remaining segments are the most representative ones. still, due to the shortage of data (annotated videos) for the learn- ing process, its preferable to maximize the iterations within one video, to increase the learning results even with using less resources. Additionally, it is not easy to select the right segments from the first run. It is obvious that finding the worst segment in a video is less complex and more reward- ing than directly finding the segments of interest. Thus, we 294 Informatica46 (2022) 291–298 S. Aberkane et al. adopt the second plan, where an agent performs an action a i by removing a segmentx i at states t . Therefore the ac- tion space is limited, and not the same at each stept. The state s t is represented by the remaining segments aftert moves, the actiona i is represented by excluding a segmentx i at movet. Excluding a segment may lead to two states:s t+1 and termination, where termination means that we already found a set of segments that contain an anomaly. The estimation feedback from environment r i for (s t ,a i ) is not only provided by video level annotations, but also by video segment level annotations provided by a nominated expertE c in anomaly recognition. The expert teaches our network as its recognition performance indi- cates the qualities of input segments x i . To force the RL agent to learn the environment dynamics by itself. The en- vironment does not provide any other feedback to the agent apart from the state and the reward. 3.2 Deepdueling-basedanomalydetection The system is mainly based on the dueling structure, which relies on two different sub-networks. They share the same features extractor layer [17]. The inputs is a video that is subsequently fragmented to 32 segments x i , and consid- ered as the initial states 0 att 0 . V =S 0 ={ x i | 1≤ i≤ 32} (2) To transform the raw video segment data into an under- standable format to the artificial agent, for each video seg- mentx i , we extract the visual features using the C3D Fea- ture Extractor [17], then we obtain the corresponding final state format as follows: V =S 0 ={ f xi i | 1≤ i≤ 32} (3) Technically, each states t is a set of visual features repre- sentationf xi i , which encapsulates all the data about the 3D convolutional features of the video segmentsX i Video . To estimate the anomaly value of statess t /video, all the extracted segment featuresf xi i are combined together, only to be used by the advantage stream as an input, which gives a single output valueV(s t ), the latter is given by equation 4. The stream’s output indicates the probability of a video containing an anomaly. V(s t ,θ,β ) =V({ f xi i | 1≤ i≤ N} ,θ,β ) (4) The action-dependent advantage function A(s,a) com- putes the value of the advantage of selecting a particular action(or segment) over the base value of being in the cur- rent state (or video). We estimate the advantage stream by using the C3D features of the N remaining segments separately as input whereN = 32 at stept 0 /states 0 presented by the follow- ing formula: A(s t ,a t ,θ,α ) =A({ f xi i | 1≤ i≤ N} ,f xi i ,θ,α ) (5) After the state value V(s t ) and the video segments’ value A(s t ,a i ) are calculated, the output values of these two streams will be combined by an aggregation layer to evaluate each video segmentx i , in accordance with the fol- lowing equation: Q(s t ,a i ,θ,α,β ) =V(s t ,θ,β )+A(s t ,a t ,θ,α )− 1 A P at+1 A(s t ,a t ,θ,α ) (6) Q(s t ,a i ) corresponds to the conditional probability of executing actiona i , which represents the deletion of a seg- mentx i from a states t+1 at stept, the criteria of deletion is defined as a minimumQ value among values. a i | s t =    notargmin a (s t ,a) x i is a part ofs t+1 argmin a Q(s t ,a) x i is not a part ofs t+1 (7) The featuresf i of the selected segmentsx i will in turn be extracted at states t to return the next states t+1 as fol- lowing: S t+1 =    P N i=1 f xi i − f x b b ifb = argmin a Q(s t ,a) Terminal ifN = 5 (8) Then, the system judges the agent’s decision by reward r t provided by two different functions. For the videos an- notated as normal, and abnormal defined by equations 9 and 10 respectively. The reward functions are as follows: r V n (s t ,a i ,s t+1 ) = ( +1 IFQ(s t ,a i ) = 0 − Q(s t ,a) ELSE (9) r V ab (s t ,a i ,s t+1 ) =    +1 IFQ(s t ,a i ) =E c (s t ,a i ) +| 1− ((s t ,a i )− E c (s t ,a i ))| IFQ,E c T h −| 1− ((s t ,a i )− E c (s t ,a i ))| IFQ|E c T h (10) T h is a predefined value 2 < T h < 5 representing the threshold that is used to signal an anomaly. The goal is reached once the anomaly is located in abnormal videos, or all segments of a normal video are well-judgedQ = 0. 3.3 Prioritizedexperiencereplay It is an improvement [18] to the Experience Replay mecha- nism used in the DQN algorithm that outperformed humans in Atari games [15]. The basic Experience Replay samples the batch uniformly (selecting the experiences randomly for training) these relevant experiences that occur rarely have practically no chance of being selected. As the name suggests, in Prioritized Experience Replay, a buffer is cre- ated to store the transition tuples by changing the sampling distribution based on a criterion to define the priority of each tuple of experience. The replay buffer is a cacheD of Deep Reinforcement Learning-based Anomaly. . . Informatica46 (2022) 291–298 295 Figure 3: Deep dueling system for training a video surveillance agent. Figure 4: The value stream. finite size to feed the neural network. Each(s t ,a t ,r t ,s t+1 ) transition relative to the tempral-difference (TD) error. The highest priority is given to samples that produced a larger TD error, plus some constant to avoid zero probability for an experience being chosen. 4 Experimentalresults Unlike the dataset used in [12], which is based on GPS trajectory, we choose to train our system on a dataset in- cluding multiple anomaly classes that are similar to real- world anomalies, in order to get as close as possible to the context of surveillance videos. So, we perform ex- periments on a large-scale dataset named UCF-Anomaly- Detection-Dataset[11] to evaluate the performance of our DRL anomaly event detector agent. The dataset is com- posed of long untrimmed surveillance videos which cover 13 real-world anomalies, including Abuse, Arrest, Arson, Assault, Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. The UCF- Anomaly-Detection-Dataset collected a 950 unedited real- Figure 5: The advantage stream. world surveillance videos with clear anomalies as well as a 950 normal videos. The UCF dataset provides only video- level annotations, however, to train the system, we need segment-video level labels. For that purpose, we used an external expert system for video segment evaluation. In re- mainder of this section, we will describe the methods and the setup used for configuring and evaluating the learning video surveillance agent, and expose the details of the ex- periment results. Additionally, we compared our approach with state-of-the-art video anomaly detection. 4.1 Hyper-parameters In our experiments, the preprocessing of the video data is made through the extraction of the visual features from the fully connected (FC) layer FC6 of the C3D network pro- vided by authors of [17]. Before computing features, we re- size each video frame to 240x320 pixels and fix the frame rate to 30 fps. We compute C3D features for every 16- frame video clip followed by l2 normalization. The agent’s network is implemented with a fully connected Feedfor- 296 Informatica46 (2022) 291–298 S. Aberkane et al. ward neural network setup by the network configuration described in Table 2. The network includes 3 layers. The ReLU function [19] is used for the two first layers, the out- put layer takes the sigmoid function for activation to build the output. LayerN 1 st layer 2 nd layer Output layer Type Dense Dense Dense Unitsize 512 256 64 Activation ReLU ReLU Sigmoid Weight Regularizer decay l2(0.001) decay l2(0.001) decay l2(0.001) Table 2: The network configuration. After splitting every video into32 non-overlapping seg- ments, the agent starts learning by playing one episode per video. During an episode, the agent is allowed to play a number of steps to find the segments that cover the anomaly to be reported. The number value of steps depends on the annotation of the processed video, if it is annotated as nor- mal, the agent has 32 stepst to remove all segments with a low abnormality score, otherwise, the segments are re- moved as there is no anomaly. In the case of an episode of a video annotated as abnormal, the number 5 has been set as a minimum of segments per states t , so the agent has27 chances to remove the segments with the weakest anomaly score, in other words, it keeps the segment with high pre- dicted anomaly score. We built our agent on1600 episodes. Training was performed on a 4GB NVIDIA GeForce RTX 2070 SUPER GPU. The methods are implemented using Python with the help of Keras. The training process works under Adagrad optimizer algorithm [20] with MSE loss function and a learning rate with a value of 0.01, it’s re- maining parameters are set as default. 4.2 Resultsandanalysis Firstly, we study the sample results shown in Figures 6 and 7. Figure 6 represent a successful abnormality detection on videos containing anomalous events. The localization of anomalies is highlighted by the red frames, corresponding to the highest predicted anomaly score, and green frames highlighted segments corresponding to anomaly score ap- proaching zero. The figure 7 represents a successful abnor- mality detection in videos with normal events, all frames in the video are shown in green. Meaning that all segments in the video have a low anomaly score. The false alarms are considered as the weak point of an artificial video surveillance system. Based on the observa- tions, we have made several attempts to reduce the false anomaly detection in the system. During the first evalu- ation, we noticed a large number of false negative cases, then we deduced that this was due to the default value of the number of minimum final segments (equal to 5). So, we rebuilt our model based on a new criterion, which is the final minimum number of segments will be decide by the expertE c . In others words, for a given video that contains an anomaly, the stop criterion of the episode is the number of segments whose anomaly value judged by the expert is higher than a given threshold. Across many evaluations, the threshold value with the best results are3.2. We also observed a high score of false alarms in case of sudden people assembly as it happened with our principal chosen expertE c , to reduce this phenomenon, we defined this case as a very important experience through prioritized experience replay mechanism. We managed to reduce the error score by up to 60.1%. However, the system failed in many cases of very crowded scenes. The goal was to completely automate the video surveil- lance system. However, there were some false alarms, so we decided to set criteria to trigger the alarm (such as call- ing the police, or locking all the doors, etc). Therefore, the predicted anomaly score should be greater than a threshold to trigger the alarm automatically. Otherwise, we propose to send the video segment to a human assistant to take the final decision. 4.3 ComparisonwithSOTAmethods Table 3 summarizes the comparison of the proposed ap- proach with the existing state-of-the-art methods using two different datasets. For the case of UCSD-dataset, our ap- proach demonstrates inferior performance compared to the methods including Qiao et al [7], Khaleghi et al[8], Shean Chong et al[9], Hasan et al[10]. On the other side, Approach References UCSD Dataset UCF Dataset Machine Learning [21] 63.8% 54.3% Deep Learning [7] [8] [9] [10] [11] 98.33% 88.1% 89.9% 90.0% - - - - 65.5% 75.41% Proposed system # 87.44% 83.12% Table 3: AUC comparison of the proposed system with SOTA baseline models on both UCSD dataset and UCF- Anomaly-Detection-Dataset using machine learning and deep learning methods. the proposed system produces superior performance compared to the machine learning-based methods as Lu et al[21]. As far as we know, deep learning is a dominant source nowadays due to achieving high performance in many fields. the table shows that deep learning -based methods achieved better results than our approach. The obtained performance is due to the volume of data which is consid- ered insufficient as it includes only 50 video samples for Deep Reinforcement Learning-based Anomaly. . . Informatica46 (2022) 291–298 297 Figure 6: Examples of anomaly detection in Abnormal Videos. Figure 7: Example of anomaly detection in Normal Videos. training, we assume that it surely affects the agent’s learn- ing process. For UCF-Anomaly-Detection-Dataset that is consider- ably larger than the previous one, our system outperforms the deep learning and machine learning techniques. We conclude that the proposed system requires a big amount of training data for optimal performance. [11] provides a framework to detect suspicious events in video surveillance by combining the two techniques of multiple instance learning with deep learning, resulting in an anomaly scores for each video. Expert systemE c Basic results Proposed model results Deep MIL[11] 75.41 % 83.12 % Dictionary[21] 54.3 % 65.2 % Deep auto-encoder[10] 65.5 % 71.09 % Table 4: AUC comparison of multiples methods used as Expert system. Our agent learned faster since in [11], the system started to predict the right anomalous score after 3000 iterations, but in our system, it was just after1450 iterations. it is due to the fact that we have strengthened the learning phase with segment-level labels and the increasing of the explo- ration time. We were also able to surpass the compared method [11] in the reduction of false alarms as it gener- ates a score of 1.9 for normal videos versus a score of 1.02 generated by our system. Additionally, by compar- ing our anomaly detection approach to two anomaly de- tection models: dictionary based approach [21] and deep auto-encoder based approach [10], the settings used for the comparison of both models are exactly the same ones set by [11]. Table 4 shows the comparison results for [11, 21, 10] frameworks, while simultaneously using the said ap- proaches as the anomaly detection expertE c , as well as a provider of the video-segments level annotations. 5 Conclusion In this paper, an automatic video surveillance system in- cluding an anomaly detection based on deep reinforcement learning technique is proposed. In order to accelerate the agent’s learning process and achieve a higher accuracy, this approach is relying not only on the annotations of the videos-level, but also on a video segment-level score pro- vided by an expert system. 298 Informatica46 (2022) 291–298 S. Aberkane et al. The system is trained on a variety of real-world anoma- lies to make it as efficient as possible in real life situations. The described method has achieved a very competitive per- formance that has surpassed some expert performances. Based on those results, we concluded that segment level an- notations would greatly increase the system’s performance if the annotations were done by a humans. We employed many techniques of reinforcement learn- ing such as prioritized replay and dueling architecture, though, there are still more recent improvements such as Rainbow model or NROWAN-DQN for network noise re- duction. References [1] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. IEEE Trans, Pattern Analysis and Machine Intelligence, 2013. https://doi.org/10.1109/tpami.2012.59. [2] K. Simonyan and A. Zisserman. Two-stream convo- lutional networks for action recognition in videos. NIPS, 2014. [3] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video clas- sification with convolutional neural networks. CVPR, 2014. https://doi.org/10.1109/cvpr.2014.223. [4] C. Schuldt, I. Laptev, and B. Caputo. Recognizing hu- man actions: A local svm approach. in IEEE ICPR, 2004. https://doi.org/10.1109/icpr.2004.1334462. [5] M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: Video structure com- parison for recognition of complex hu- man activities. ICCV, pp. 1593–1600, 2009. https://doi.org/10.1109/iccv.2009.5459361. [6] W. Hu, D. Xie, Z. Fu, W. Zeng, and S. Maybank. Semantic-based surveillance video retrieval Image Processing. IEEE Transac- tions, vol. 16, no. 4, pp. 1168–1181, 2007. https://doi.org/10.1109/tip.2006.891352. [7] Meina Qiao, Tian Wang, Jiakun Li, Ce Li, Zhiwei Lin, and Hichem Snoussi. Abnormal event detection based on deep autoencoder fus- ing optical flow. Control Conference (CCC) 36th, IEEE, Chinese, pages 11098-11103, 2017. https://doi.org/10.23919/chicc.2017.8029129. [8] Ali Khaleghi and Mohammad Shahram Moin. Improved anomaly detection in surveillance videos based on a deep learning method. 2018 8th Conference of AI & Robotics and 10th RoboCup Iranopen International Sympo- sium (IRANOPEN), IEEE, pages 73–81, 2018. https://doi.org/10.1109/rios.2018.8406634. [9] Yong Shean Chong and Yong Haur Tay. Abnor- mal event detection in videos using spatiotempo- ral autoencoder. International Symposium on Neu- ral Networks, Springer, pages 189–196, 2017. https://doi.org/10.1109/ascc.2015.7244871. [10] M. Hasan, J. Choi, J. Neumann, A. K. Roy- Chowdhury, and L. S. Davis. Learning tempo- ral regularity in video sequences. CVPR, 2016. https://doi.org/10.1109/cvpr.2016.86. [11] Waqas Sultani, Chen Chen, and Mubarak Shah. Real- world anomaly detection in surveillance videos. Cen- ter for Research in Computer Vision (CRCV), 2018. https://doi.org/10.1109/cvpr.2018.00678. [12] Min-hwan Oh, Garud Iyengar. Sequential Anomaly Detection using Inverse Reinforcement Learning. arXiv:2004.10398v1 [cs.LG], 2020. [13] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot and Nando de Freitas. Duel- ing Network Architectures for Deep Reinforcement Learning, arXiv:1511.06581v3[cs.LG], 2016. [14] V . Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforce- ment learning, Nature, 518(7540):529–533, 2015. https://doi.org/10.1038/nature14236. [15] Mnih, V . Kavukcuoglu, K. Silver, D. Graves, A. Antonoglou, I. Wierstra, D. Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013. [16] X. Lan, H. Wang, S. Gong, and X Zhu. Deep reinforcement learning attention selec- tion for person re-identification. BMVC, 2017. https://doi.org/10.5244/c.31.121. [17] D. Tran, L. Bourdev, R. Fergus, L. Torre- sani, and M. Paluri. Learning spatiotemporal fea- tures with 3d convolutional networks. ICCV, 2015. https://doi.org/10.1109/iccv.2015.510. [18] Tom Schaul, John Quan, Ioannis Antonoglou and David Silver. Prioritizied Experience Replay. arXiv:1511.05952v4 [cs.LG], 2016. [19] Xavier Glorot, Antoine Bordes, Yoshua Bengio.Deep Sparse Rectifier Neural Networks. Proceedings of the Fourteenth International Conference on Artificial In- telligence and Statistics. PMLR 15:315-323, 2011. [20] J. Duchi, E. Hazan, and Y . Singer. Adaptive subgra- dient methods for online learning and stochastic opti- mization. J. Mach. Learn. Res, 2011. [21] C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. ICCV, 2013. https://doi.org/10.1109/iccv.2013.338.