Invariant Audio Prints for Music Indexing and Alignment

Invariant Audio Prints for Music Indexing and Alignment

Rémi Mignot, Geoffroy Peeters

Welcome to the companion web site of our paper “Invariant Audio Prints for Music Indexing and Alignment”. This page provides illustrations of the experiments.

Cite the paper:

   Rémi Mignot, Geoffroy Peeters
   "Invariant Audio Prints for Music Indexing and Alignment"
   21st Int. Conf. on Content-based Multimedia Indexing (CBMI)
   Reykjavik, Iceland, September, 2024.


Experiment 1: Audio Indexing and Segmentation of Medleys

Here is an illustration of the results of the experiment "Audio Indexing and Segmentation of Medleys" of Sec. IV.A. A music medley has been created using excerpts of 6 pop songs. The excerpts last between 15 and 30 sec. According to the configuration, each excerpt is transposed by 0, ±1, ±2 or ±4 half-tones, and the time is stretched with a factor: 0 cents, ±33 cents or ±66 cents. For example, with 33 cents, the time is accelerated by a factor 1.26, and with 66 cents: 1.58. First without audio degradations, then with audio degradations: filtering, noise addition, 40kbps MP3 coding, and distortion (see the paper).

The objective of this experiment is to found the original excerpts of the reference songs among a catalog of approximately 40,000 songs, with their time position in the reference songs, and the used stretching factor. Note that the pitch shifting factor is not estimated. This experiment tries to demonstrate the robustness of the developped audio prints to different transformations: pitch shifting, time stretching and other degradations (noise addition, distortion, filtering, ...). Finally, using the time alignment of the method, we can estimate also the time stretching factor in order realign the reference excerpts to the modified medley.

Below, for each configuration, 2 music previews are available:

  1. first, the original and analysed medley, with or without modifications of the excerpts,
  2. the second preview contains the original medley on the left channel, and the medley rebuilt from the detection on the right channel.
    The estimated time positions and stretching factor are used for the reconstruction, but we use here the true pitch factor (because it is not estimated).
    So, the time precision can be heard by comparing the synchronisation of the two channels.

Remark: when the mouse cursor is over a cell of the table, the figure on the right displays the result for the corresponding configuration. The colored rectangles represent: the true songs on the bottom half, and the detected songs on the top half, and the white spaces of the upper part are for time ranges when no song is detected. The black lines (resp. red) plot the time mapping between the medley time (x-axis) and the time of the true (resp. detected) reference songs (y-axis). The slope of the black lines informs about the time stretching of the segments.

Remark: here it is only an illustration for one medley. The displayed results of the paper have been averaged over 730 different medleys. And remind that the audio prints of the catalog are computed on the original songs, without transformation.

Without degradations

Time stretching (cents)
Pitch Shiftings (½ tones)

 

0 (ct) ± 33 (ct) ± 66 (ct)
0 (ht)
± 1 (ht)
± 2 (ht)
± 4 (ht)

With degradations

Time stretching (cents)
Pitch Shiftings (½ tones)

 

0 (ct) ± 33 (ct) ± 66 (ct)
0 (ht)
± 1 (ht)
± 2 (ht)
± 4 (ht)


Experiment 2: Audio-to-Audio Alignment

This section illustrates the results of the experiment "Audio-to-Audio Alignment" of Sec. IV.B. An original MIDI file has been modified in order to constinuously change the tempo of the song. By comparing the synthesized audio signals of the original song and the modified song, this experiment tries to estimate the time mapping between the two signals.

To evaluate also the robustness to the other modifications, the pitch has been changed, by shifting the MIDI note numbers, and for the second set of tests, the instruments were also randomly changed, and the drum track was removed.

The configurations are:

Below, for each confugation, 2 music previews are available:

  1. first, the modified song,
  2. the second preview contains the modified song on the left channel, and on the right channel, the original song is realigned to the modified song using the estimated alignment.
    The estimated time positions and stretching factor are used for the reconstruction, but we use here the true pitch factor (because it is not estimated).
    So, the time precision can be heard by comparing the synchronisation of the two channels.
    The realignment is done in the audio domain with a phasis vocoder method.

Remark: when the mouse cursor is over a cell of the table, the figure on the right displays the result for the corresponding configuration. The blue curve of the upper figure presents the time error of the estimation time mapping. The second figure compares the true time stretching factor (varying in time) and the estimated time stretching factor. Finally, the third figure displays the local distance matrix based on the Hamming distance of the Audio Prints.

Remark: here it is only an illustration for one MIDI song. The displayed results of the paper have been averaged over 238 different MIDI songs.

Same instruments

Original music ( ):
Time stretching (cents)
Pitch Shiftings
(½ tones)

 

± 33 (ct) ± 66 (ct) ± 100 (ct)
0 (ht)
± 2 (ht)
± 6 (ht)
± 13 (ht)

Different instruments and removed drum

Original music ( ):
Time stretching (cents)
Pitch Shiftings
(½ tones)

 

± 33 (ct) ± 66 (ct) ± 100 (ct)
0 (ht)
± 2 (ht)
± 6 (ht)
± 13 (ht)

BONUS: Audio-to-Audio Alignment of a Guitar Cover

Here is an additional experiment (not presented in the paper) for a real-world example. This experiment tries to align in time the original recording of "Little Wing" of the "Jimi's Hendrix Experience" with a cover played by Corey Heuvel on an acoustic guitar. After the analysis of the original recording and of the cover, the original recording is realigned in time to the cover, and added to the cover video.

For the first realigned video, the left channel contains the cover (unchanged) and the right channel contains the realigned original recording of Jimi Hendrix. For the second realigned video, only the realigned original recording is played.

Remark that whereas the original song has a tempo of 70 BPM approximately, the average tempo of the cover is almost 60 BPM, and changes a little in time. Additionnally, some transitions between the introduction, the verses and the solo are longer on the cover. For example, listen to the transition between the second verse and the solo at 1:42 of the cover: the time of the original recording is strongly dilated in order to be synchronised before and after the transition.

Original
Jimi Hendrix - Little Wing
Cover
Corey Heuvel - Little Wing (acoustic)
        


Alignement
left: cover, right: aligned original
Alignement
Only aligned original