MELODY EXTRACTION ON VOCAL SEGMENTS USING MULTI-COLUMN DEEP NEURAL NETWORKS
Sangeun Kum, Changheun Oh, Juhan Nam
17th International Society for Music Information Retrieval Conference, New York, USA, 2016
|PDF| |Slide| |GitHub|
The definition of "Melody extraction" is that automatically obtaining the f0 curve of the predominant melodic line drawn from multiple sources. Various algorithms have been proposed so far and they can be broadly classified into three categories:
- A salience-based approaches use a salience function to estimate the salience of each possible pitch value.
- A source-separation based approaches isolate the melody source from the mixture. These two approaches are majority of the melody extraction algorithms
- On the other hand, Data-driven based approach is rarely attempted.
Therefore, we addressed some issues. No one attempt to use deep neural network to extract melody. Deep learning is really hot keyword in research area in these days. Deep learning has proved having great performance with sufficient labeled data and computing power. So we tried to use deep learning to estimate the melody contours.
We used multi-column deep neural network (MCDNN).
- The MCDNN was originally devised as an ensemble method to improve the performance of DNN for image classification. Several deep neural columns become experts on inputs in different ways, therefore by averaging each predictions, we can decrease the errors.
- It was applied to image denoising as well. In this approach, each column was trained on a different type of noise and the outputs were weighted to handle noise types.
This is our architecture of a proposed methods.
- By using Multi Column DNN, our model produces a finer pitch resolution more accurately. Each of the DNN columns takes multi-fame spectrogram frames as input to capture contextual information from neighboring frames.
- And each DNN columns predict pitch labels with different resolutions. The lowest resolution is 1 semitone. The next one has higher resolutions by two times. Pitch predictions with lower resolutions are actually expanded by replicating each element so that the output sizes are the same for the all columns.
Given the outputs of the columns, we compute the combined posterior likelihood. Mathematically, we multiplied all probabilities together, which corresponds to summing the log-likelihood of the predictions.
Here we verify it by illustrating three examples from different models. We selected an opera song from the ADC2004 dataset, because this song has dynamic pitch motions such as high pitch and strong vibrato.
- A left one is from the Single Column DNN with a pitch resolution of 4 and trained only with the RWC dataset.
- A middle one is from the same SCDNN but trained with additional data. Comparing the first models, the additional songs help tracking the vibrato.But the second model still misses the whole excursion.
- A right one is from the 1-2-4 Multi-column DNN. With the additional resolutions, the Multi-column DNN makes further improvement, tracking the pitch contours quite precisely.