ERK'2022, Portorož, 359-362 359 Computational analysis of rhythmic data using RDA Arthur Vestu 1 , Lily-Jade Roldao 1 , Miha Moˇ skon 2 1 ESIGELEC Graduate School of Engineering, Rouen, France 2 Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia E-mail: miha.moskon@fri.uni-lj.si Abstract Rhythmic processes can be found in different contexts that range from biological to socio-technical systems. Several computational methods have been introduced to study such processes. However, these have mostly been adapted to work well with specific data and need to be manually adapted for a wider usage. We describe a software frame- work dedicated to a comprehensive analysis of rhythmic datasets. It integrates different state-of-the-art methods dedicated to the identification and characterisation of rhy- thmic processes. It allows its users to straightforwardly apply different methods to a selected dataset, and to iden- tify the method yielding the results with the largest rele- vance in a given context. We demonstrate the application of the proposed framework on two examples. Firstly, we report the results of a benchmarking experiment, which also indicate the classification performance of each of the implemented methods (i.e., classifying measurements among rhythmic and non-rhythmic groups). Secondly, we present the application of the proposed framework on the assessment of rhythmic trends in traffic data reflect- ing circadian rhythmicity. We believe that the proposed package will find a vast scope of applications in different scientific domains. 1 Introduction Rhythmic processes, i.e., processes that reflect periodic response, are pervasive in our environment. For exam- ple, circadian clocks present biological clocks that dis- play daily (approximately 24-hour period) oscillations and regulate up to half of all genes in an organism [1]. Since a disruption of these rhythms can lead to the develop- ment of different diseases, their analysis has gained sig- nificant importance in recent years [2]. In this context researchers are combining different experimental tech- niques with computational approaches [3]. Several com- putational methods for the identification and characteri- sation of rhythmic trends have been proposed in recent years [2]. These methods have been mostly adapted to work well with specific biological data, i.e. transcrip- tomic circadian time-series datasets. However, detection and analysis of rhythmic patterns have become an impor- tant aspect also in fields of research outside biology and medicine, such as urban planning. On the other hand, the majority of the state-of-the-art methods for rhythmic- ity detection and analysis cannot be straightforwardly ap- plied to such datasets. In this work we describe a computational framework for domain-agnostic analysis of rhythmic datasets. The framework combines different methods for rhythmicity detection and analysis. It accepts input data in standard- ised formats, and is able to produce publication-ready fig- ures and results in a tabular format for straightforward analysis. Moreover, the framework implements a set of functionalities which can be used for benchmarking ex- periments. We demonstrate the proposed framework and the methods it incorporates on two case studies. Firstly, we describe a benchmarking experiment of synthetically generated data. Secondly, we describe the application of the framework on the real data presenting the average car- travel paces throughout the day on a selected road seg- ment in the city of Ljubljana. We use the obtained results to discuss the pros and cons of each of the selected meth- ods. 2 Methods 2.1 Identification and characterisation of rhythmic data Several computational approaches to identify and charac- terise rhythmic data have been introduced in recent years [2]. In this section we describe some of the most popu- lar methods, which have also been incorporated into the proposed computational framework. 2.1.1 Lomb-Scargle Lomb-Scargle periodogram (LS) is one of the first algo- rithms devoted to the identification of rhythmicity in data. It presents a parametric model that identifies oscillations by comparing the data to sinusoidal curves [4]. LS can handle data quality issues like replicates, missing values or uneven sampling. 2.1.2 Cosinor Cosinor is a trigonometric regression model similarly to LS. Cosinor is a parametric model which allow us to pre- cisely identify oscillatory datasets as well as amplitudes and acrophases of oscillations. Cosinor can produce rel- evant results even when data quality is poor (irregular in- tervals, unbalanced data full of outliers) [5]. 360 2.1.3 ARSER ARSER is similar to Cosinor as it is also a parametric method using harmonic regression [6]. ARSER addition- ally uses autoregressive spectral estimation to estimate the period of the data. Nevertheless, it is sensitive to data quality issues which can lead to inaccuracies. Moreover, it does not work when there are replicates of data. 2.1.4 JTK CYCLE JTK CYCLE is a non-parametric method that detects os- cillations by comparing the ranks of the measured val- ues to a set of specified symmetric reference curves [7]. JTK CYCLE works well with replicates and missing val- ues. However, uneven sampling can lead to inaccurate acrophase estimations. 2.1.5 RAIN RAIN (Rhythmicity Analysis Incorporating Nonparamet- ric methods) is an upgraded version of JTK CYCLE us- ing asymmetric waveforms [8]. RAIN examines the in- creasing and decreasing portions of the curve separately. It is more tolerant to data quality issues than JTK CYCLE. 2.1.6 meta2d Metacycle presents a platform that can be used to merge the results of different methods. It applies the function meta2d [9], which is based on Fisher’s combined proba- bility test to combine the results of ARSER, JTK CYCLE and/or LS. 2.2 Benchmarking of methods for rhythmicity anal- ysis One of the problems of benchmarking of methods on real data is that the ground truth, e.g., does a specific dataset reflect oscillatory response or not, is usually not known. This problem can be solved with the application of syn- thetic data. Synthetic data presenting rhythmic as well as arrhythmic time-series can be created with samples from signals with user-defined parameters (e.g., periods and amplitudes), and with the addition of Gaussian noise to recreate the natural as well as technical variance [3]. However, there are certain guidelines that should be fol- lowed when collecting/generating data for experiments involving rhythmicity analysis [3]. For example, the data should be collected/generated for at least two periods to reduce the sensitivity of computational methods to out- liers (number of false negatives). 2.3 A computational framework for domain-agnostic analysis of rhythmic data We present a Python package RDA (Rhythmic Data Anal- ysis) with the implementation of different functions that allow the user to perform rhythmic data analysis using LS, ARSER, JTK CYCLE, Cosinor, RAIN, and meta2d. The package can accept the data in generalised input for- mats suitable for an arbitrary scientific domain, and can produce different types of visualisation of the obtained results, namely p-value distributions, Venn diagrams, and classification performance measures (when the target la- bels are known). The RDA implementation, documenta- tion and examples are available athttps://github. com/VESTUArthur/RDA. 3 Results 3.1 Case study 1: benchmarking of methods on syn- thetic data We generated the synthetic data using the functionalities of the CosinorPy package [10] with different noise lev- els relative to the oscillation amplitudes (0.3, 0.6, 0.9) and different number of cosinor components (1, 2 or 3). The data were generated in a 48 hour interval with a 2 hour sampling resolution. We did not use replicates be- cause the ARSER does not work with replicated data. For each configuration we generated 5,000 rhythmic and 5,000 non-rhythmic time-series data, and each configura- tion was repeated 6 times. The period of rhythmic data was set to 24 hours. We tested these datasets using differ- ent methods and we evaluated their performance through various metrics, such as Matthew’s correlation coefficient (MCC) [11], area under curve (AUC), precision, recall, f1-score and accuracy (see Figure 1). Figure 2 presents the results of MCC assessment for all the experiments. method method method method method method f1 precision recall accuracy auc mcc Figure 1: Evaluation of methods on synthetic data generated us- ing a single-component cosinor model and a noise levels of 0.3. The experiment was repeated 6 times and error bars represent standard error of each metric for each model. Abbreviations and symbols: ARS – ARSER, JTK – JTK CYLCE, LS – Lomb- Scargle, meta2d – Metacycle, Cosinor – zero-amplitude test us- ing a multi-component cosinor, Cosinor1 – zero-amplitude us- ing a single-component cosinor, Cosinor1(q) – model signifi- cance test using a single-component cosinor, RAIN – Rhyth- micity Analysis Incorporating Nonparametric methods. 361 mcc mcc mcc mcc mcc mcc mcc mcc mcc 1 component, noise = 0.3 1 component, noise = 0.6 1 component, noise = 0.9 2 components, noise = 0.3 2 components, noise = 0.6 2 components, noise = 0.9 3 components, noise = 0.3 3 components, noise = 0.6 3 components, noise = 0.9 Figure 2: Matthew’s Correlation Coefficient (MCC) for each method evaluated on data generated with different number of harmonic component and noise levels. Each experiment was repeated 6 times and error bars represent standard error of each metric for each model. Abbreviations and symbols: ARS – ARSER, JTK – JTK CYLCE, LS – Lomb-Scargle, meta2d – Metacycle, Cosinor – zero-amplitude test using a multi-component cosinor, Cosinor1 – zero-amplitude using a single-component cosinor, Cosinor1(q) – model significance test using a single-component cosinor, RAIN – Rhythmicity Analysis Incorporating Nonparametric methods. 3.2 Case study 2: application of the framework on traffic data In our second case study we applied the proposed frame- work to the evaluation of rhythmic trends in traffic data. We performed the analysis of data obtained on different road segments using Google Directions API [12] as de- scribed in [13]. Briefly, travel times on a segment were obtained using a 10 minutes sampling resolution. Route lengths and travel times were then converted to average paces for each route and for each time sampled. In the analysis we employed the same methods as in the first case study. We tested the capability of each method to assess the rhythmicity parameters that can be used for in- terpretation, namely locations of each peak and MESOR (midline estimating statistic of rhythm) values. Each met- hod returns different parameters, which can be interpreted in a similar way (see Table 1). We had problems using RAIN due to the large amount of data which caused the memory exceeded error. More- over, to test the data on ARSER it was necessary to av- erage the replicates on each timepoint, and fill missing values. For each method, we plotted the peaks obtained. Nevertheless, ARSER peak was not plotted since it does not yield the peak occurrence. Figure 3 indicates the consistence between the observed data and the assessed peaks. We can see that Cosinor yields the most accurate lo- cations of peaks. Moreover, a multi-component cosinor model is the only method, that is able to assess the loca- Table 1: Interpretation of rhythmicity parameters as returned by the implementation of each method on the selected datset. Ab- breviations and symbols: ARS – ARSER, JTK – JTK CYLCE, LS – Lomb-Scargle, Cosinor – multi-component cosinor, Cosi- nor1 – single-component cosinor, N/A – not available, t(peak) – occurrence of peak(s), height(peak) – height of peak(s), MESOR – midline estimating statistic of rhythm. Method t(peak) height(peak) MESOR ARS N/A amplitude mean JTK LAG AMP mean LS PeakSPD PhaseShiftHeight N/A Cosinor peak heights mean Cosinor1 acrophase[h] amplitude mean tions of multiple peaks per one rhythmicity period. Cosi- nor also yields the most user-friendly output with exact locations of peaks and their heights. 4 Conclusions In this paper we proposed a framework for domain-agnostic analysis of rhythmic data. We benchmarked the methods incorporated within the proposed framework using syn- thetic data. Moreover, we demonstrated the applicability of the framework on the traffic data reflecting circadian trends with two distinct peaks per period. Each of the methods applied in our analysis exhib- 362 Figure 3: The consistence between the observed data and as- sessed peaks for each of the selected methods. Black dots rep- resent the locations of assessed peaks. The analysis was per- formed on route 0 – a more detailed description of the data and the location of this route is available at [13]. ited certain advantages in comparison to other methods. For example, Cosinor can be used to accurately evalu- ate the rhythmicity parameters, but is in some cases less successful in classification than other (non-parametric) methods (see Figure 2). On the other hand, RAIN and JTK CYCLE were specifically developed for hypothe- sis testing but yield inconsistent or biased estimations of rhythmicity parameters [8]. We also tested recently de- veloped PyBOAT method [14], which can be used to ac- curately assess rhythmicity parameters, but failed on both of our case studies. The period obtained with this method was between 400 and 1000 minutes while the true pe- riod of the data was 1440 minutes (i.e., 24 hours) in both cases. When performing the analyses one must identify its goals and select the most suitable method accordingly. Our recommendation is that different methods are used together to take advantages of each method separately and thus obtain the results with higher significance. The proposed framework allows the researcher to do this stra- ightforwardly. Our future work will be focused to more detailed ben- chmarking of the applied methods as well as to the appli- cation of the proposed framework on other types of rhyth- mic data. Moreover, our work has been mostly focused to the identification and characterisation of individual time- series data. In the near future we will extend the frame- work and its benchmarking with comparative analyses of rhythms from different fields of science. Additionally, we will extend the framework with implementations of addi- tional state-of-the-art methods. Acknowledgements This work has been partially supported by the scientific research program P2-0359, and by the basic research projects J5-1798, both financed by the Slovenian Research Agency. A V and LJR were supported by the Erasmus+ exchange programme of the European Union. The funding bodies had no role in the design of the study and collection, analysis, and interpretation of data nor in writing the manuscript. References [1] R. Zhang, N. Lahens, H. Ballance, E. Hughes, and J. Ho- genesch, “A circadian gene expression atlas in mam- mals: implications for biology and medicine.,” Proceed- ings of the National Academy of Sciences, vol. 111, no. 45, pp. 16219–16224, 2014. [2] M. Wenwen, J. Zhiwen, C. Yang, C. Li, S. Aziz, and J. Yuchao, “Genome-wide circadian rhythm detection methods: systematic evaluations and practical guide- lines.,” Briefings in Bioinformatics, vol. 22, no. bbaa135, 2021. [3] M. Hughes, K. Abruzzi, R. Allada, R. Anafi, A. Arpat, G. Asher, P. Baldi, C. de Bekker, D. Bell-Pedersen, J. Blau, S. Brown, M. Ceriani, Z. Chen, J. Chiu, J. Cox, A. Crowell, J. DeBruyne, D. Dijk, L. DiTacchio, and J. Hogenesch, “Guidelines for genome-scale analysis of biological rhythms,” Journal of biological rhythms, pp. 380–393, 2017. [4] J. T. VanderPlas, “Understanding the lomb–scargle peri- odogram,” The Astrophysical Journal Supplement Series, vol. 236, no. 1, p. 16, 2018. [5] G. Cornelissen, “Cosinor-based rhythmometry,” Theoreti- cal Biology and Medical Modelling, vol. 11, no. 1, pp. 1– 24, 2014. [6] R. Yang and Z. Su, “Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation,” Bioinformatics, vol. 26, no. 12, pp. i168– i174, 2010. [7] M. E. Hughes, J. B. Hogenesch, and K. Kornacker, “JTK CYCLE: an efficient nonparametric algorithm for detecting rhythmic components in genome-scale data sets,” Journal of biological rhythms, vol. 25, no. 5, pp. 372–380, 2010. [8] P. F. Thaben and P. O. Westermark, “Detecting rhythms in time series with rain,” Journal of biological rhythms, vol. 29, no. 6, pp. 391–400, 2014. [9] G. Wu, R. C. Anafi, M. E. Hughes, K. Kornacker, and J. B. Hogenesch, “Metacycle: an integrated r package to evaluate periodicity in large scale data.,” Bioinformatics, vol. 32(21), no. 3351–3353, 2016. [10] M. Moˇ skon, “CosinorPy: a Python package for cosinor- based rhythmometry,” BMC bioinformatics, vol. 21, no. 1, pp. 1–12, 2020. [11] D. Chicco and G. Jurman, “The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation,” BMC ge- nomics, vol. 21, no. 1, pp. 1–13, 2020. [12] Google Inc., “The Directions API overview.” https://developers.google.com/maps/ documentation/directions/overview, 2022. [13] ˇ Spela Verovˇ sek, M. Juvanˇ ciˇ c, S. Petrovˇ ciˇ c, T. Zupanˇ ciˇ c, M. Svetina, M. Janeˇ z, ˇ Ziga Puˇ snik, N. Velikajne, and M. Moˇ skon, “An integrative approach to neighbourhood sustainability assessments using publicly available traf- fic data,” Computers, Environment and Urban Systems, vol. 95, p. 101805, 2022. [14] C. Schmal, G. M¨ onke, and A. E. Granada, “Analysis of complex circadian time series data using wavelets,” in Cir- cadian Regulation, pp. 35–54, Springer, 2022.