ERK'2020, Portorož, 367-370 367 A review and comparison of time series similarity measures Maˇ sa Kljun 1 , Matija Terˇ sek 1 , Erik ˇ Strumbelj 1 1 FacultyofComputerandInformationscience,UniversityofLjubljana E-mail: mk2700@student.uni-lj.si,mt2421@student.uni-lj.si,erik.strumbelj@fri.uni-lj.si Abstract We review 12 time series similarity measures and inves- tigate their time complexity, normalization, invariance with respect to warping and scaling, support of time se- riesofdifferentlengths,andotherproperties. Weshowon simulated data that several similarity measures perform wellonaverage,butnoneperformwellinallcasesandin some cases measures that typically perform poorly, such ascompression-basedsimilarity,areabetteralternative. 1 Introduction Measuring similarity between time series is an important component in time series data analysis, especially unsu- pervised learning. Many different measures exist and it is often not clear which measure is the best choice for the test at hand or how different measures compare with respect to relevant properties such as invariance to scal- ing/warping and time complexity. The few related works are Wang et al. [13] who com- pare 9 measures but omit those based on correlation coef- ficients or compression. Serra and Arcos [10] and G´ orecki and Piasecki [6] compare 7 and 30 measures, respectively, but focus on 1-NN classification performance and not clustering performance and other properties as we do. Es- ling and Agon [4] do focus on other properties, but not on clustering performance. We aim to provide a compact review and classifica- tion of the most commonly used similarity measures and relevant properties which are often excluded in related work. Furthermore, we use several simulated data sets to empirically evaluate how different measures compare to each other and how well they perform in clustering. 2 Distance measure features In this paper we will view time series similarity measures in terms of these properties, which are relevant to choos- ing the best similarity measure for the task at hand: Time complexity. Can compare time series of different lengths. Normalization. Does increasing the length or sam- pling frequency of the time series, without chang- ing any other properties, change the value? If so, we provide a factor that normalizes the measure and facilitates comparison across time series of dif- ferent lengths. Invariance/robustness with respect to warping and scaling. Warping is a change of the time se- ries’ times that preserves the ordering. Scaling is multiplication of the time series’ values with a con- stant. Related work is inconsistent about warp- ing, so we additionally defineweakinvariance (the same change is applied to both compared time se- ries) and strong invariance (the change is applied to only one of the time series). Strong invariance to warping implies weak invariance to warping. A summary of similarity measures is in Table 1. 3 Distance measures LetX = x 1 ;:::;x n andY = y 1 ;:::;y m be the two time series whose similarity we are interested in. We also use X n and X 1 to represent X without the last and first point, respectively. 3.1 L p norms/distances Depending on the value ofp, we have: Manhattan (p = 1): P n i=1 jx i y i j. Minkowski (1 > > > > > < > > > > > > : P m i=1 jx i gj ;n = 0 P n i=1 jy i gj ;m = 0 minfE(X 1 ;Y 1 )+d(x 1 ;y 1 ); E(X 1 ;Y)+d(x 1 ;g); E(X;Y 1 )+d(y 1 ;g)g ;else ; (1) with d(x;y) being defined as a L 1 norm between the points x and y [2]. If one of them is a gap point, it is equal to a user defined constantg. 3.6 Longest common subsequence We use the Longest common subsequence (LCSS) model to cope with various problems such as different sampling rates, different lengths, outliers, and efficiency [12]. It allows for unmatched elements and efficient approximate calculation. It is defined as D(;;X;Y ) = 1 L ; (X;Y) min(n;m) ; (2) where L is a function defined asL ; (X;Y) = 8 > < > : 0; A 1+L ; (X n ;Y m ); B maxfL ; (X n ;Y),L ; (X;Y m )g; else ; (3) whereA =fn = 0_m = 0g andB =fjx xn y xm j< ; jx yn y ym j<; jn mj<g. 3.7 TQuEST This similarity measure starts by transforming each time series into disjoint intervals such that within every inter- val all the time series’ values are above a threshold . Let S(X; ) be the unique such transformation of X where the intervals are the largest. The TQuEST measure is defined as: T(X;Y) = 1 jS(X;"))j X s2(X;")) min s 0 2S(Y;")) d(s;s 0 ) + 1 jS(Y;")j X s 0 2S(Y;") min s2S(X;")) d(s 0 ;s) ; (4) whered(a;b) is the Euclidean distance between two time intervals [1]. 3.8 Cross-corelation Cross-correlation (CCor) is based on the Pearson cor- relation coefficient [7]. It is defined as d CC (X;Y) = q 1 CC0(X;Y) P ml k=0 CC k (X;Y) , where CC k is the lag-k covariance and ml is the maximum allowed lag between X and Y and should not exceed the length of the series. By de- fault, it isminfn;mg 1 [7]. 3.9 Compression-based similarity measure The compression-based similarity measure (CDM) is a class of measures defined as CDM(X;Y) = C(XY) C(X)+C(Y) , whereC(X) is the size in bytes of the compressed time seriesX andXY stands for concatenated time seriesX andY . Any compression algorithm can be used forC( ). In the empirical evaluation, we use gzip. 3.10 Piccolo distance This similarity measure was introduced by Piccolo [9] as a measure of similarity between two ARIMA processes. It is defined as the Euclidean distance of the coefficients of the series’AR(1) formulations. The coefficients of the lower order series are padded with zeros to the length of the larger order. The Piccolo distance exist for any invertible ARIMA process [8]. 3.11 Prediction-based distance This is a class of similarity measures based on the idea that two time series are similar if they are close at a spe- cific time point in the future. Vilar et al. [11] is an im- plementation of this idea where forecast densities at a specific point in the future T + h are compared. The distance is then calculated as an indefinite integral of ab- solute difference between estimates of the forecast densi- ties for time seriesX andY at timeT +h [8]. We set the forecast horizonh to 1 in our empirical evaluation. 3.12 Embedding-based similarity This class of measures is based on learning a vector repre- sentation of time series and then computing their similar- ity using a vector similarity measure, such as Euclidean distance. We implemented this idea using Euclidean dis- tance and Random Warping Series (RWS) to create a vec- tor representation of time series [14]. This method uses 369 the DTW between the given time series and the random time series distribution. Then a family of positive definite kernels can be constructed from a map given by DTW. Optimal are found using cross-validation. The time complexity depends on the model used but is typicallyO(n), excluding the time we require to learn the representation. It is normalized, weakly invariant to warping, but it is not invariant to scaling. In the empirical evaluation, we use an 8-dimensional representation (16 and 32-dimensions do not lead to better results). 4 Classification of similarity measures Wang et al. [13] propose the following categories: lock- step, elastic, threshold-based, and pattern-based measures. Montero et al. [8] propose: complexity-based, prediction- based, model-free, and model-based measures. Esling and Agon [4] propose: shape-based, edit-based, feature- based, and structure-based measures. Each of these clas- sifications provides a different but incomplete view of similarity measures. A reconciliation is beyond the scope of this paper but for the sake of completeness, we classify the measures used in this paper according to each of the three classifications (see Table 1). 5 Empirical evaluation and comparison We generated 9 groups of 100 time series of length 100 (see Table 2). For each pair of groups we computed the similarity for each pair of time series. We then clustered them into 2 clusters with k-medoids clustering. We eval- uated the clustering using the adjusted Rand index. We repeated this process for each similarity measure. The purpose of this experiment was twofold. First, to identify which measures give similar values (see Figure 1). And second, to highlight scenarios where a similarity measure might fail and which alternative could be used (see Table 3 for a summary). 6 Discussion While we did not cover all, we did cover the most popular similarity measures and at least one representative from each class of similarity measures, except Spatial Assem- bling Distance (SpADe). Several similarity measures per- form well on average, but none perform well in all cases. In some cases less known measures like compression- based similarity are better, even though they typically perform poorly. Therefore, choosing the best similarity measure for the task at hand is not a trivial task and there is value in our review of their properties. Piccolo dis- tance stands out as the only linear complexity similarity measure with good overall performance. Our embedding- based approach achieves similar performance and also has linear time complexity, excluding the time we re- quire to learn the embedding. However, it was learned on labelled data and is not generally applicable. The results are consistent with the results from Wang et al. [13], Serra and Arcos [10], and G´ orecki and Piasecki [6], where DTW and Edit distance measures performed best although Serra and Arcos [10] and G´ orecki and Piasecki -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 dtw dissim ccor edr erp lcss tquest cdm pic pred rws_8 eucl dtw dissim ccor edr erp lcss tquest cdm pic pred 0.82 0.99 0.84 0.66 0.55 0.65 0.6 0.65 0.6 0.59 0.83 0.97 0.85 0.61 0.7 0.57 0.69 0.57 0.56 0.97 0.74 -0.01 0.06 0.01 -0.05 0.14 0.06 0.18 0.02 0.1 0.04 0.06 0.25 0.16 0.28 0.17 0.63 0.68 0.66 0.55 0.5 0.75 0.53 -0.04 0.13 0.44 0.65 0.5 0.45 0.67 0.71 0.76 0.22 0.32 0.53 0.65 0.91 0.7 0.47 0.62 0.94 0.7 0.08 0.19 0.69 0.76 Figure 1: Pearson correlation between similarity measures across all pairs of time series from all nine groups. DTW, Euclidean distance, DISSIM, EDR, ERP, LCSS, and RWS all give numerically similar results. The following are particularly similar: Euclidean distance and DISSIM, DTW and ERP and RWS, and EDR and LCSS. CCor, PIC, and PRED are some- what similar to each other and other methods. TQuEST and CDM strongly differ from the other measures and each other. [6] report that advanced modifications of DTW outper- formed other measures. Further investigation of embeddings-based approaches is our main direction for further work. In particular, em- beddings that are based on unlabelled data. While these approaches have been tremendously successful in image and video analysis and there have been notable applica- tions in time series, there is no systematic treatment in the context of time series similarity and clustering. Finally, classification of time series similarities requires further work in order to reconcile the differences and inconsis- tencies between existing classifications and produce a more generally applicable classification. References [1] Aßfalg, J., Kriegel, H.-P., Kr¨ oger, P., Kunath, P., Pryakhin, A., and Renz, M. (2006). Tquest: threshold query execution for large sets of time series. In Inter- national Conference on Extending Database Technol- ogy, pages 1147–1150. Springer. [2] Chen, L. and Ng, R. (2004). On the marriage of lp- norms and edit distance. In Proceedings of the Thirti- ethinternationalconferenceonVerylargedatabases- Volume30, pages 792–803. [3] Chen, L., ¨ Ozsu, M. T., and Oria, V . (2005). Robust and fast similarity search for moving object trajecto- ries. In Proceedings of the 2005 ACM SIGMOD in- ternationalconferenceonManagementofdata, pages 491–502. 370 warping scaling Comp. normalization Wang Esling Montero Diff. lengths Lp strong no O(n) p p n lock-step shape model-free no DISSIM no no O(n+m) n 1 lock-step shape model-free no DTW strong no O(nm) n;m;n+m elastic shape model-free yes ERP weak y no O(nm) max(n;m) elastic edit model-free yes EDR weak y no O(nm) max(n;m) elastic edit model-free yes LCSS weak y no O(nm) normalized elastic edit model-free yes Tquest weak y no O(nmlognm) normalized threshold feature model-free yes CCor strong yes O(nm) ( p ml) 1 elastic feature model-free yes PIC strong yes O(n+m) max(k1;k2) / structure model-based yes CDM weak y no O(n+m) normalized / structure complexity yes PRED no no O(n+m) normalized / structure prediction yes RWS weak no O(n) normalized / structure model-based no Table 1: Our classifications are initalics,= denotes that Wang et al. [13] treated different representations separately from similarity measures, denotes the most typical implementation but may vary with choice of underlying models, y denotes invariance based on the original definitions, where time is considered, but this vary with implementation. group type noise G1 - N G2 - G G3 linear N G4 linear G G5 linear + varying slope N G6 sine N G7 sine G G8 sine + varying phase N G9 sine + varying amplitude N Table 2: A summary of the 9 groups of simulated time series. N is normally distributed noise ( = 0; = 0:5), G is gamma- distributed noise ( = 0:5; = 3). [4] Esling, P. and Agon, C. (2012). Time-series data min- ing. ACMComputingSurveys(CSUR), 45(1):1–34. [5] Giorgino, T. et al. (2009). Computing and visualiz- ing dynamic time warping alignments in R: the dtw package. JournalofstatisticalSoftware, 31(7):1–24. [6] G´ orecki, T. and Piasecki, P. (2019). A Comprehen- sive Comparison of Distance Measures for Time Se- ries Classification. In Workshop on Stochastic Mod- els, Statistics and their Application, pages 409–428. Springer. [7] Liao, T. W. (2005). Clustering of time series data—a survey. Patternrecognition, 38(11):1857–1874. [8] Montero, P., Vilar, J. A., et al. (2014). TSclust: An R package for time series clustering. Journal of Statisti- calSoftware, 62(1):1–43. [9] Piccolo, D. (1990). A distance measure for classify- ing ARIMA models. Journal of Time Series Analysis, 11(2):153–164. [10] Serra, J. and Arcos, J. L. (2014). An empirical eval- uation of similarity measures for time series classifi- cation. Knowledge-BasedSystems, 67:305–314. [11] Vilar, J. A., Alonso, A. M., and Vilar, J. M. (2010). Non-linear time series clustering based on mean median min eucl 0.61 0.61 0.00 dissim 0.65 0.97 0.00 dtw 0.80 1.00 -0.00 edr 0.80 1.00 -0.00 erp 0.85 1.00 -0.00 lcss 0.83 1.00 -0.00 ccor 0.64 0.70 -0.00 cdm 0.21 0.04 -0.00 pic 0.85 1.00 0.01 pred 0.55 0.54 -0.00 tquest 0.36 0.19 -0.00 rws 0.87 1.00 0.25 Table 3: Aggregated adjusted Rand index. DTW, EDR, ERP, LCSS, CCor, PIC, and RWS are the best, on average, and CDM and TQuEST perform poorly. However, all similarity measures have poor worst-case performance - for each there is at least one pair of groups where it performs no better than random. CDM is the best at distinguishing between linear time series and linear time series with varying slope, where all other sim- ilarity measures perform poorly. Most measures also poorly distinguish between G1-G2 (normal noise vs gamma noise), G1-G5, and G2-G5 (normal or gamma noise vs linear time series with varying slope). Complete results can be found at https://github.com/tersekmatija/ERK-2020. non-parametric forecast densities. Computational Statistics&DataAnalysis, 54(11):2850–2865. [12] Vlachos, M., Kollios, G., and Gunopulos, D. (2002). Discovering similar multidimensional trajec- tories. In Proceedings 18th international conference ondataengineering, pages 673–684. IEEE. [13] Wang, X., Mueen, A., Ding, H., Trajcevski, G., Scheuermann, P., and Keogh, E. (2013). Experi- mental comparison of representation methods and dis- tance measures for time series data. Data Mining and KnowledgeDiscovery, 26(2):275–309. [14] Wu, L., Yen, I. E.-H., Yi, J., Xu, F., Lei, Q., and Witbrock, M. (2018). Random warping series: A ran- dom features method for time-series embedding.arXiv preprintarXiv:1809.05259.