基于深度学习的作曲家分类问题

胡振; 傅昆; 张长水

doi:10.7544/issn1000-1239.2014.20140189

基于深度学习的作曲家分类问题

Audio Classical Composer Identification by Deep Neural Network

摘要

摘要: 在音乐信息检索领域，作曲家分类是一个十分重要的问题，这一问题的目标是通过音频数据来识别相应的作曲家信息.传统的分类算法都是通过提取复杂的特征来进行分类的，而深层神经网络在特征学习上具有比较强的能力，因此提出用深层神经网络来解决这一问题.为了结合不同深层神经网络模型的优点，设计了一种混合模型，该模型基于深度置信网络(deep belief network, DBN)和级联去噪自编码器(stacked denoising autoencoder, SDA)，可以较好地解决作曲家分类问题.实验表明，该模型取得了76.26%的正确率，这一结果比单纯用某一种模型搭建的深层神经网络以及支持向量机要好.和图像数据类似，人脑在提取音乐特征也是分层的，每一层对信号的处理不一样，因此混合模型在解决作曲家分类问题上具有一定的优势.

Abstract: Music is a kind of signal that has hierarchical structure. In music information retrieval (MIR) area, higher level features, such as emotion and genre, are typically extracted based on lower level features such as pitch and spectrum energy. Deep neural networks have good capacity of hierarchical feature learning, which indicates that deep learning is potentially to obtain good performance on music dataset. Audio classical composer identification (ACC) is an important problem in MIR which aims at identifying the composer for audio classical music clips. In this work, a hybrid model based on deep belief network (DBN) and stacked denoising autoencoder (SDA) is built to identify the composer from audio signal. The model get an accuracy of 76.26% in the testing data set which is better than some thoroughbred models and shallow models. After dimensionally reduced by linear discriminant analysis (LDA) it is also clear that the samples from different classes become farther away from each other when being transformed by more layers in our model. By comparing models in different sizes we give some empirical instruction for ACC problem. Similar to image, music features are hierarchical too and different parts of our brain handle signals differently. So we propose a hybrid model and our results encourage us to believe that our proposed model makes sense in some applications. During the experiments, we also find some practical guides for choosing network parameters. For example, number of neurons in the first hidden layer should be approximately 3 times to the dimension of input data.

HTML全文

参考文献(0)

施引文献

资源附件(0)