https://doi.org/10.31449/inf.v49i17.6885 Informatica 49 (2025) 145–154 145 Reduced Convolutional Recurrent Neural Network Using MFCC for Music Genre Classification on the GTZAN Dataset Ela Setiorini, Moeljono Widjaja, Arya Wicaksana * Universitas Multimedia Nusantara, Tangerang 15810 Indonesia E-mail: ela.setiorini@student.umn.ac.id, moeljono.widjaja@umn.ac.id, arya.wicaksana@umn.ac.id *Corresponding author Keywords: classification, CRNN, GTZAN, MFCC, MIR, music genre Received: August 9, 2024 This study presents a reduced Convolutional Recurrent Neural Network (CRNN) model for music genre classification, leveraging the GTZAN dataset and Mel-Frequency Cepstral Coefficient (MFCC) feature extraction. Unlike more complex architectures, this model simplifies the CRNN structure to three convolutional layers and two BiLSTM layers, maintaining competitive performance while reducing computational complexity. Key experimental parameters included learning rate tuning (0.1, 0.01, 0.001, and 0.0001) and dropout usage (30% before the BiLSTM layers) to mitigate overfitting. The best configuration, utilizing a learning rate of 0.001 and dropout, achieved an accuracy of 88.64%, outperforming more complex CRNN models by approximately 15%. These results underscore the potential of streamlined architectures in music information retrieval tasks, particularly for applications where computational resources are constrained. Future work will address overfitting issues and refine the dataset for enhanced model performance. Povzetek: Študija predstavlja poenostavljen model konvolucijsko-rekurentne nevronske mreže (CRNN) za klasifikacijo glasbenih zvrsti z uporabo GTZAN podatkovne zbirke in MFCC značilk, ki kljub zmanjšani kompleksnosti dosega visoko natančnost (88,64 %) ter presega zmogljivejše modele za približno 15 %. 1 Introduction Music genre classification is crucial for information retrieval (Music Information Retrieval) and analysis, and as digital music libraries grow, automated methods are needed to categorize and organize music. Music Information Retrieval (MIR) is a research field that focuses on analyzing and extracting music transcription, beat detection, on-set detection, and genre classification [1]. Traditional methods often rely on handcrafted features and machine learning algorithms, which need help to capture complex temporal and spectral patterns. Convolutional Recurrent Neural Networks (CRNNs), a combination of CNNs for feature extraction and RNNs for temporal dependencies, have shown promising results in extracting hierarchical features from raw audio data [2]. This work investigates further the CRNN application for MIR, specifically to classify music genres with the GTZAN dataset. The CRNN extracts local features and aggregates temporal patterns [3]. MFCC is extracted from the dataset to become an input for the model [4]. The combination of CRNN in this work is CNN with BiLSTM (Bidirectional Long Short-Term Memory). This work proposes a simpler version of the algorithms compared to Ashraf et al. [5], with only three layers of CNN and two layers of RNN (BiLSTM-BiLSTM). The accuracy of Ashraf et al. is 73.69% with five layers of CNN and three layers of RNN. The main contribution of this work is a less complex CRNN model architecture with higher accuracy. The rest of this paper is structured as follows: The Introduction section sets the background and motivation for this paper. Related Works section layouts other works in MIR that utilize machine learning and deep learning methods. The Methods section explains the data collection, requirement specification, design and implementation, and testing and evaluation. The Results and Analysis section presents empirical evidence based on four scenarios and discusses the findings. Finally, the Conclusion section summarizes the findings and outlines future works. 2 Related works Ghosh et al. [2] compared machine learning models (SVC or Support Vector Classifier, logistic regression, and ensemble learning using AdaBoost) and deep learning models (ANN or Artificial Neural Network, CNN, CRNN with CNN-LSTM combination, and PCRNN or Parallel CRNN). Inputs used in this work are feature matrix (for machine learning models) and Mel-spectrogram (for deep learning models), extracted from the FMA dataset. The CRNN has the highest accuracy at 90%, with only 480 out of 8000 data used from the dataset. Ashraf et al. [5] use a couple of CRNN combinations, i.e., CNN-LSTM, CNN-BiLSTM, CNN-GRU, and CNN- BiGRU. Inputs used in this work are the Mel-spectrogram and Mel-Frequency Cepstral Coefficient (MFCC) extracted from the GTZAN dataset. This work's highest 146 Informatica 49 (2025) 145–154 E. Setiorini et al. accuracy is obtained by CNN-BiGRU using Mel- spectrogram (89.3%) and CNN-LSTM using MFCC (76.4%). Mendes et al. [6] compared two CRNN combinations (CNN-LSTM and CNN-BiLSTM), and the results were used to get music recommendations. The input that is used in this work is the Mel-spectrogram that is extracted from the FMA dataset. CNN-BiLSTM achieved the highest accuracy, with an accuracy of 72%. Luo [7] compared two deep learning algorithms, Convolutional Neural Network (CNN) and Long Short- Term Memory (LSTM), using GTZAN and FMA (Free Music Archive) datasets. The accuracies obtained for the GTZAN dataset are 56% (CNN) and 42% (LSTM). Meanwhile, accuracies for the FMA dataset are 50.5% (CNN) and 33.5% (LSTM). Kumar et al. [8] used CNN and GTZAN datasets to classify music genres. The accuracy obtained in this work is 83%. Ghosh et al. [2] compared several machine learning and deep learning algorithms using the FMA dataset. The highest accuracy obtained in this work was achieved using a convolutional- recurrent neural network (CRNN) with an accuracy of 90%. Table 1 provides a summary and comparison of these related works. Table 1: Test results from the model Work Model Dataset Feature Extraction Accuracy Luo [7] – “Automatic Music Genre Classification based on CNN and LSTM” - CNN - LSTM - GTZAN - FMA - Mel- spectrogram (for CNN) - MFCC (for LSTM) - Highest accuracy obtained by CNN using both GTZAN dataset (56%) - FMA dataset (50.5%) Kumar et al. [8] – “Automated Music Genre Classification through Deep Learning Techniques” CNN GTZAN MFCC Accuracy obtained by the model is 83% Ghosh et al. [2] – “A Study on Music Genre Classification using Machine Learning” - Machine learning models (SVC, logistic regression, and ensemble learning using AdaBoost) - Deep learning models (ANN, CNN, CRNN with CNN-LSTM combination, PCRNN) FMA, using only 480 out of 8000 data - Feature matrix (for machine learning models) - Mel- spectrogram (for deep learning models) Highest accuracy obtained by CRNN (90%) Ashraf et al. [5] – “A Hybrid CNN and RNN Variant Model for Music Classification” - CNN-LSTM - CNN-BiLSTM - CNN-GRU - CNN-BiGRU GTZAN - Mel- spectrogram - MFCC - Highest accuracy obtained by CNN- BiGRU using Mel- spectrogram (89.3%) - CNN-LSTM using MFCC (76.4%) Mendes et al. [6] – “Deep Learning Techniques for Music Genre Classication and Building a Music Recommendation System” - CNN-LSTM - CNN-BiLSTM FMA Mel- spectrogram CNN-BiLSTM obtained the highest accuracy (90%) 3 Methods The dataset that is used in this work is the GTZAN dataset. GTZAN dataset was first created by George Tzanetakis and Perry Cook in 2002 [9]. This dataset consists of 10 genres of music, and each genre has a total of 100 WAV audio. Each audio file lasts 30 seconds with a 22,050 Hz sample rate. The genres inside this dataset are blues, classic, country, hip-hop, jazz, metal, pop, reggae, and rock. There is a single corrupted audio file within the jazz category. `The corrupted audio within the jazz category is removed before it got into data processing. Hence, the dataset contains only 999 audio files. The dataset was initially available on a MARSYAS website created by Tzanetakis and Cook [10]. Currently, the dataset is available to download at https://www.kaggle.com/datasets/andradaolteanu/gtzan- dataset-music-genre-classification (Kaggle). MFCC is extracted from each audio file inside the dataset. Sound extraction carried out by MFCC is based Reduced Convolutional Recurrent Neural Network Using MFCC… Informatica 49 (2025) 145–154 147 on estimated frequencies that humans can hear. The signal used in MFCC is the Mel scale, which uses a linear filter at frequencies below 1000 Hz and a logarithmic distance above 1000 Hz [11]. The output of this process is a spectrum wave graph or spectrogram that uses this frequency scale. This spectrogram contains feature coefficients, and these coefficient values represent the audio signal [12]. MFCC can capture important voice characteristics in recognition and critical information in voice, produce minimal data without losing much information, and replicate human hearing sound [13]. MFCC extraction is handled using Librosa, a Python package that provides audio and music signal processing [14]. Several parameters are needed to extract MFCC, such as the number of coefficients needed, window length, and hop length. This work extracts 13 cepstral coefficients from every audio. Window length and hop length used are 2,048 and 512, respectively. Each audio file is split into ten segments to augment the training data. The results of this process are saved on a JSON file containing an array with the 13 MFCC coefficient inside. The proposed CRNN model consists of three layers: a convolutional layer, a max pooling layer, and batch normalization on the CNN layer, with two BiLSTM layers for the RNN. After going into the CNN and RNN layers, the inputs are flattened, and the fully connected layer connects all the neurons with the output layer. This work investigates two models: the models that used dropout before the BiLSTM layer and those that did not use dropout. The CNN layer consists of a convolutional layer, a max pooling layer, and batch normalization. The convolutional layer uses 32 filters with 3x3 kernel size on the first and second layers and 2x2 kernel size on the third layer, including ReLU for the activation function. ReLU is preferred in this work due to its ability to remove negative values. The Max pooling layer has a 3x3 pool size for the first and second layers, 2x2 for the third layer, and 2x2 stride. Pooling is done to progressively lower the model's computational complexity, parameter count, and control overfitting [15]. Pooling reduces the size of the matrix in the feature map. Max pooling is one of the most popular forms of pooling. Max pooling extracts the highest value inside patches from the feature map and discards the rest of the values [16]. Batch normalization is employed to mitigate sudden changes in each layer [6]. The input is reshaped before being passed to the RNN layer, as the CNN and RNN layers require different input shapes. In this work, a dropout layer is added before the RNN layer as part of the investigation. Dropout is widely recognized for its ability to prevent overfitting by randomly deactivating neurons during training [17]. A dropout rate of 0.3 (30%) is used before the RNN layer. This value is selected based on its superior validation accuracy compared to dropout rates of 0.2 and 0.4. In this work, two layers of BiLSTM are used, with return sequences set to true for the first layer. Return sequences are used to send forward all of the LSTM hidden layer sequences to the next layer [18]. The input went to the dense or fully connected layer. The input is flattened before going to the dense layer. The number of filters used for the dense layer is 64 units with the ReLU activation function, which is then followed by another dropout of 30%. Figure 1 shows the overview of the model proposed in this work. Figure 1: Architecture overview of the proposed CRNN model. The dataset is divided into training, validation, and test sets with a 60-20-20 ratio using a simple train- validation-test split. Initially, the dataset is split into training and test sets with an 80-20 ratio. The 80% training set is then further divided into training and validation sets with a 75-25 ratio. The validation set is used to evaluate the model during training, while the test set is reserved exclusively for testing the trained model and is not used during the training process. Several parameters, such as the optimizer, loss function, and evaluation metrics, are configured for the model. The optimizer employed in this work is Adam, tested with multiple learning rates: 0.1, 0.01, 0.001, and 0.0001. These learning rates were selected to observe the accuracy trend, specifically whether smaller learning rates lead to higher accuracy. The loss function is sparse categorical cross-entropy, and the primary performance metric is accuracy. The model is trained for 100 epochs with a batch size of 32. After training, the model is evaluated using the test set. The evaluation metrics include accuracy, loss, precision, recall, and the F1 score. 4 Results and analysis This work tests several parameters, namely dropout before the RNN layer and several learning rates values. Model testing uses the test set, which has not been used in any process. Table 2 shows the model's testing result using the test set. 148 Informatica 49 (2025) 145–154 E. Setiorini et al. Table 2: Test results from the model Dropout Learning rate Accuracy Loss Precision Recall F1 score - 0.1 10.06% 230.86% 1.01% 10.06% 1.84% - 0.01 74.07% 91.52% 74.82% 74.07% 73.81% - 0.001 86.74% 64.69% 86.84% 86.74% 86.67% - 0.0001 84.68% 60.25% 84.7% 84.68% 84.56% ✓ 0.1 10.91% 230.88% 1.19% 10.91% 2.15% ✓ 0.01 74.72% 75.5% 74.94% 74.72% 74.43% ✓ 0.001 88.64% 51.95% 88.85% 88.64% 88.62% ✓ 0.0001 86.24% 46.91% 86.28% 86.24% 86.19% Both models, with or without dropout that used a 0.1 learning rate had very low accuracy because the learning rate was too big. A significant learning rate would speed up the training process, but the model could need more time to analyze the data thoroughly. Based on the accuracy, models with a learning rate of 0.1 could not study the data. Models with a 0.01 learning rate had similar accuracy, around 74%. Models using a 0.001 learning rate had the highest result out of all learning rates, but when compared to the use of dropouts, the model using dropouts had higher accuracy. The model that used dropout and a 0.001 learning rate got the highest accuracy out of all other tested models, with an accuracy of 88.64%. Models with a 0.0001 learning rate achieved high results, but not higher than models with a 0.001 learning rate. Loss results from models that used dropout are significantly lower than those that did not. Meanwhile, each model's precision, recall, and F1 scores have similar results in terms of accuracy. The following figures are accuracy and loss graphs from the training process. The models suffer from overfitting, but models that used dropout suffered less than those that did not. Figure 2 shows training results from models without dropouts, and Figure 3 shows training results from models with dropouts. The blue line represents results from the train set, and the orange line represents results from the validation set. (a) (b) Reduced Convolutional Recurrent Neural Network Using MFCC… Informatica 49 (2025) 145–154 149 (c) (d) Figure 2: Accuracy and loss graph for models without dropout with (a) Learning rate 0.1, (b) Learning rate 0.01, (c) Learning rate 0.001, and (d) Learning rate 0.0001 (a) (b) (c) (d) Figure 3: Accuracy and loss graph for models with dropout with (a) Learning rate 0.1, (b) Learning rate 0.01, (c) Learning rate 0.001, and (d) Learning rate 0.0001 Based on Figures 2 and 3, graphs from learning rate 0.1 show underfitting due to the high loss and low accuracy achieved on all epochs. The results indicate that the model is unable to learn from the data. A learning rate of 0.01 shows overfitting only for the model without dropout. The other model had a good fit because the difference between the train and validation set is minimal. Learning rates of 0.001 and 0.0001 also show overfitting, but the model with dropout is better than without dropout. Thus, the use of dropout effectively reduces overfitting on models. Weight regularization such as L2 are not implemented in this model. Only dropout is used in this 150 Informatica 49 (2025) 145–154 E. Setiorini et al. model to maintain the simplicity of the model and also due to device limitation. Further evaluation on Figures 4 and 5 depict the confusion matrix of the results. Figure 4 shows the confusion matrix from models without dropouts, and Figure 5 shows the confusion matrix from models with dropouts. The confusion matrix for models with a learning rate of 0.1 showed that the model could not classify any data. Meanwhile, models with a learning rate of 0.01 made many mistakes when classifying data. The accuracy achieved using this learning rate is relatively low, around 70%. Learning rates 0.01 and 0.001 show overfitting but are still able to classify the data sufficiently. (a) (b) (c) (d) Figure 4: Confusion matrix of models without dropout with (a) Learning rate 0.1, (b) Learning rate 0.01, (c) Learning rate 0.001, and (d) Learning rate 0.0001 Reduced Convolutional Recurrent Neural Network Using MFCC… Informatica 49 (2025) 145–154 151 (a) (b) (c) (d) Figure 5: Confusion matrix of models with dropout with (a) Learning rate 0.1, (b) Learning rate 0.01, (c) Learning rate 0.001, and (d) Learning rate 0.0001 The ROC curves in Figures 6 and 7 provide detailed insights into the classification performance of the models under different configurations. Key metrics such as the true positive rate (TPR) and false positive rate (FPR) are visually represented, where a steeper curve rising toward the top-left corner indicates higher sensitivity (recall) and better classification of positive instances. Additionally, a lower FPR, reflected in curves closer to the y-axis, suggests improved discrimination between classes. The area under the curve (AUC) serves as a summary metric, quantifying the model's overall performance. Higher AUC values, closer to 1.0, denote stronger classification capabilities. By comparing models with and without dropout across varying learning rates, it becomes evident that the inclusion of dropout generally enhances the stability and sharpness of the curves, indicating better generalization and resistance to overfitting. Notably, models with lower learning rates (e.g., 0.001 and 0.0001) demonstrate more consistent and pronounced ROC curves, suggesting that these configurations strike a balance between convergence and classification accuracy. 152 Informatica 49 (2025) 145–154 E. Setiorini et al. (a) (b) (c) (d) Figure 6: ROC curves of models without dropout with (a) Learning rate 0.1, (b) Learning rate 0.01, (c) Learning rate 0.001, and (d) Learning rate 0.0001 (a) (b) Reduced Convolutional Recurrent Neural Network Using MFCC… Informatica 49 (2025) 145–154 153 (c) (d) Figure 7: ROC curves of models with dropout with (a) Learning rate 0.1, (b) Learning rate 0.01, (c) Learning rate 0.001, and (d) Learning rate 0.0001 A confusion matrix of a test set with 1,998 data points shows that the model with dropout and learning rates 0.01, 0.001, and 0.0001 can classify 1,493, 1,771, and 1,723 data, respectively. In contrast, the model without dropout can classify 1,480, 1,733, and 1,692 data out of 1,998. The model with a learning rate of 0.1 will not be discussed further because the model cannot classify the data. The model using dropout and learning rate 0.001 has the highest number of correct classification results compared to the other models, indicating that it is the best model. Further analysis of the results indicates that the classical genre is the most accurately classified genre. The rock genre is the most misclassified due to the proximity between rock, country, and disco. The dataset also poses discrepancies with other references regarding the genre of certain audio files. In addition, the lack of artist variation in the dataset also affects the model classification ability due to the lack of variation of music genres, as an artist tends to produce music within one music genre. Audio from the rock genre was frequently misclassified as country, disco, or blues. Additionally, the model often misclassified non-rock audio as rock. Upon closer examination of the dataset, this issue appears to stem from the presence of numerous repetitive patterns and mislabeled samples under the rock category. The findings of this study have significant implications for the field of music information retrieval (MIR), particularly in advancing genre classification systems. The results demonstrate that a simpler CRNN architecture, when paired with effective regularization techniques like dropout and optimized learning rates, can achieve competitive accuracy, outperforming more complex models while maintaining computational efficiency. The highest classification accuracy of 88.64% highlights the potential of lightweight models in resource- constrained environments. Moreover, the analysis underscores critical challenges in MIR, such as dataset quality and diversity. Issues like mislabeled samples, genre overlaps, and limited artist variation directly impact classification performance, as evidenced by the frequent misclassification of rock as similar genres like country, disco, and blues. These challenges emphasize the importance of curated datasets and robust preprocessing techniques in developing reliable MIR systems. Addressing these limitations, along with strategies like artist-level cross-validation and stratified sampling, could not only enhance genre classification accuracy but also extend the applicability of MIR systems to more nuanced tasks, such as personalized music recommendations and musicological analysis. 5 Conclusion This work proposes a simpler architecture of a CRNN algorithm to classify music. The aim is to see whether a greater accuracy is achieved compared to more complex models. The proposed model comprises three CNN layers and two RNN (specifically BiLSTM) layers. The feature extraction used in this model is MFCC, which splits the dataset into ten segments to increase the total data. A dropout level of 0.001 learning rate achieved the highest accuracy of 88.64%. The accuracy is higher than the previous work accuracy of 73.69% using the same CRNN model with more CNN and RNN layers. The primary limitation of this work is that the model tends to overfit, necessitating further efforts to mitigate this issue. Future work could focus on cleansing the dataset by removing redundant and incorrectly labeled songs, which may enable the model to learn more effectively from the data. Additionally, incorporating early stopping techniques could help prevent overfitting. Other potential strategies include applying weight regularization, implementing artist-level cross-validation, and utilizing stratified sampling to improve the model's robustness. Acknowledgements The authors express gratitude to Universitas Multimedia Nusantara for the support and acknowledge the valuable input of the reviewers and associate editor. 154 Informatica 49 (2025) 145–154 E. Setiorini et al. References [1] D. Stefani and L. Turchet, “On the Challenges of Embedded Real-time Music Information Retrieval,” in Proceedings of the International Conference on Digital Audio Effects, DAFx, 2022, pp. 177–184. [2] P. Ghosh, S. Mahapatra, S. Jana, and R. Kr. Jha, “A Study on Music Genre Classification using Machine Learning,” Int. J. Eng. Bus. Soc. Sci., vol. 1, no. 04, pp. 308–320, 2023, doi: 10.58451/ijebss.v1i04.55. [3] J. Sang, S. Park, and J. Lee, “Convolutional recurrent neural networks for urban sound classification using raw waveforms,” Eur. Signal Process. Conf., vol. 2018-Septe, pp. 2444–2448, 2018, doi: 10.23919/EUSIPCO.2018.8553247. [4] V. Bella and S. A. Sanjaya, “Refining Baby Cry Classification using Data Augmentation (Time- Stretching and Pitch-Shifting), MFCC Feature Extraction, and LSTM Modeling,” in 2023 7th International Conference on New Media Studies (CONMEDIA), IEEE, 2023. doi: https://doi.org/10.1109/CONMEDIA60526.2023. 10428158. [5] M. Ashraf et al., “A Hybrid CNN and RNN Variant Model for Music Classification,” Appl. Sci., vol. 13, no. 3, 2023, doi: 10.3390/app13031476. [6] J. Mendes, “Deep Learning Techniques for Music Genre Classification and Building a Music Recommendation System,” National College of Ireland, 2020. [7] X. Luo, “Automatic Music Genre Classification based on CNN and LSTM,” Highlights Sci. Eng. Technol., vol. 39, pp. 61–66, 2023, doi: 10.54097/hset.v39i.6494. [8] M. K. Kumar, K. Sujanasri, B. Neha, G. Akshara, P. Chugh, and P. Haindavi, “Automated Music Genre Classification through Deep Learning Techniques,” E3S Web Conf., vol. 430, 2023, doi: 10.1051/e3sconf/202343001033. [9] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293– 302, 2002, doi: 10.1109/TSA.2002.800560. [10] G. Tzanetakis and P. Cook, “MARSYAS: A framework for audio analysis”. [11] S. Y. Yehezkiel and Y. Suyanto, “Music Genre Identification Using SVM and MFCC Feature Extraction,” IJEIS (Indonesian J. Electron. Instrum. Syst., vol. 12, no. 2, p. 115, 2022, doi: 10.22146/ijeis.70898. [12] I. D. G. Y. A. Wibawa and I. D. M. B. A. Darmawan, “Implementation of audio recognition using mel frequency cepstrum coefficient and dynamic time warping in wirama praharsini,” J. Phys. Conf. Ser., vol. 1722, no. 1, 2021, doi: 10.1088/1742-6596/1722/1/012014. [13] H. Heriyanto, T. Wahyuningrum, and G. F. Fitriana, “Classification of Javanese Script Hanacara Voice Using Mel Frequency Cepstral Coefficient MFCC and Selection of Dominant Weight Features,” J. Infotel, vol. 13, no. 2, pp. 84– 93, 2021, doi: 10.20895/infotel.v13i2.657. [14] B. McFee et al., “librosa: Audio and Music Signal Analysis in Python,” Proc. 14th Python Sci. Conf., no. August 2020, pp. 18–24, 2015, doi: 10.25080/majora-7b98e3ed-003. [15] P. A. Aritonang, M. E. Johan, and I. Prasetiawan, “Aspect-Based Sentiment Analysis on Application Review using CNN (Case Study : Peduli Lindungi Application),” Ultim. Infosys J. Ilmu Sist. Inf., vol. 13, no. 1, pp. 54–61, 2022. [16] R. Yamashita, M. Nishio, R. Kinh, G. Do, and K. Togashi, “Convolutional neural networks : an overview and application in radiology,” pp. 611– 629, 2018. [17] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014. [18] A. Katrompas and V. Metsis, “Enhancing LSTM Models with Self-attention and Stateful Training,” in Lecture Notes in Networks and Systems, 2022, pp. 217–235. doi: 10.1007/978-3-030-82193- 7_14.