DNN are mainly used for MRS domain for two aspects.

  1. extracting latent factors of music items for audio signals or metadata latent item factors are integrated into content-based filtering and hybrid MRS
  2. learning sequential patterns of music items (tracks or artists) from music playlists or listening sessions. used for sequential music recommendation = automatic playlist continuation.


  1. particularities of the music domain in RS research.
  2. gives an insight of nowadays SOTA DL for MR.


1.1 Music Information Retrieval

MIR has its origins in signal precessing which leads to CB approaches

CB approaches: extracted from the actual audio signal

MIR related research

  • musical score following
  • intelligent music browsing interfaces
  • automatic music categorization (into genres or affective categories ex. mood)

Much research in MIR has addressed audio similarity (which is a prerequisite to build CB) but little research has been done with music recommendation.

1.2 Recommender Systems


Why music is different

music data is different from other recommender systems such as products, movies, hotels etc

  1. short consumption time:
    the duration of a music track is much shorter than the duration of a movie, holiday, product usage.

  2. abundance of songs available:
    the number of items in commercial music catalogs has a magnitude of tracks.

1,2 -> There’s no need to match users’ preference perfectly on music recommendation, since it does not affect user experience overly negative.

3 -> CB features extracted from audio signal affect a tremendous impact on other domains

  1. music evoke very strong emotions if those recommendations perfectly matches preference.
    State-of-the-art music emotion recognition techniques often make use of DL [11, 33].

  2. music is often consumed in sequence, typically as playlists of music tracks or listening sessions.
    Therefore, recommending not only an unordered set of songs, but a meaningful sequence of songs, is an important task in the music domain.

    some DL techniques have been developed to leverage sequential information
    for instance, Recurrent Neural Networks etc.. greatly boosts approaches for automated playlist generation or next-track recommendation

Content-Based and Hybrid Approaches

Recommender systems research in the music domain that leverages DL typically uses deep neural networks (DNN) to derive song or artist representations (embeddings or latent factors) from the audio content or textual metadata such as artist biographies or user-generated tags.

These latent item factors are then either directly used in CBF systems such as,

  1. nearest neighbor recommendation, integrated into
  2. matrix factorization approaches
  3. leveraged to build hybrid systems: most commonly integrating CBF and CF techniques.

earliest work that uses DL for CB MRS (sort by date)

  1. name: van den Oord et al.’s method:
    • adopt CNN using ReLU
    • no dropout -> to represent each song by 50 latent factors learned from audio features. input data: use short music audio snippets retrieved from 7digital5 for tracks in the Million Song Dataset (MSD). detail:
    • Training the CNN is then performed on log-compressed Mel spectrograms (128 frequency bands, window size of 23 ms, 50% window overlap),
      computed from randomly sampled 3-second-clips of the audio snippets.
    • Two algorithmic variants are investigated:
    • minimizing the mean squared error (MSE) and minimizing the weighted prediction error (WPE) as objective function.
    • Experiments are conducted on 382 K songs and 1M users of the MSD.
    • experiments: conducted on 382 K songs and 1M users of the MSD.
    • play counts: play counts for user–song pairs are converted to binary implicit feedback data (i.e., 1 if user u listened to item i regardless of the listening frequency; 0 otherwise).
  2. name: Wang and Wang method:
    • use a deep belief network (DBN)
    • mini-batch stochastic gradient descent and standard maximum likelihood estimation (MLE) for training. input data: randomly sampled 5-second-clips of audio snippets fro, 7digital. detail:
    • compute spectrograms (120 frequency bands) on windows of 30 ms (no overlap), resulting in a 166×120-matrix-representation of each 5-secondclip
      Eventually, principal components analysis (PCA) is applied to reduce the dimensionality to 100
      and this reduced signal representation is fed into the DBN.
    • The item representations learned by the DBN are then integrated into a graphical linear model together with implicit user preferences.
    • evaluation: performed on the listening data of the top 100 K users in the MSD [35] who listened to 283 K unique songs.
    • play counts: Play counts are converted into implicit feedback. The authors investigate a warmstart scenario (all users and all items in the test set also appear in the training set) and a cold-start scenario (all users but not all items appear in the training set).

Using the root mean squared error (RMSE) as performance metric, the proposed DBN-based approach achieves 0.323 in warm-start and 0.478 in cold-start. Given the binary rating representation, a baseline that randomly predicts 0 or 1 would achieve an RMSE of 0.707; a mean predictor that always predicts a rating of 0.5 would achieve an RMSE of 0.500. Results of the DBN are almost equal to those achieved by the best-performing approach by van den Oord et al. [23], i.e., RMSE of 0.325 in warm-start and 0.495 in cold-start. Wang and Wang also propose a hybrid MRS that integrates the DBN output and a probabilistic matrix factorization model (PMF) [36] for collaborative filtering. This hybrid achieves an RMSE of 0.255 (warm-start). Liang et al. propose a hybrid MRS that integrates content features learned via a multi-layer perceptron (MLP) as prior into probabilistic matrix factorization [25]. The authors train a MLP (3 fully-connected layers, ReLU activation, dropout, mini-batch stochastic gradient descent) using as input data vector-quantized MFCCs of 370K tracks of the MSD. The MLP is trained with the objective to predict 561 user-generated tags, i.e., for an autotagging or tag prediction task. The authors then use the output of the last hidden layer (1,200 units) as latent content representation of songs and assume that this representation captures music semantics. This latent content model is integrated as prior into a PMF model, which is trained with MLE. Evaluation is performed on subsets of the MSD for warm-start and cold-start (new items) situations on 614 K users and 97 K songs. In the warm-start scenario, Liang et al.'s hybrid approach using MLP and PMF performs equal to an approach that directly uses the vectorquantized MFCCs instead of training a MLP and also equal to a standard weighted matrix factorization (WMF) approach [37]; all achieve a normalized discounted cumulative gain (NDCG) of 0.288. The cold-start scenario illustrates that using the latent features given by the MLP clearly outperforms the sole use of MFCC features (NDCG of 0.161 vs. 0.143). Oramas et al. propose an approach to create separate representations of music artists and of music tracks, and integrate both into a CBF system [26]. First, they use WMF on implicit feedback data (derived from play counts) to obtain latent factors for artists and for songs. Subsequently, DNNs are trained to learn the latent artist and the latent song factors independently, using as input artist and track embeddings created from artist biographies and song content, respectively. To create song embeddings, spectrograms are computed using the constant-Q transform (96 frequency bands, window size of 46 ms, no overlap). For each track, only one 15-secondsnippet is considered. A CNN with ReLU activation and 50% dropout trained on the fixed-length CQT patches is then used to compute track embeddings. Artist embeddings are learned from biographies enriched with information from the DBpedia6 knowledge graph and represented as term frequency-inverse document frequency (TF-IDF) feature vectors [38]. A MLP using as input these TF-IDF vectors is then trained to obtain latent artist factors. The artist and track features are finally combined in a late fusion fashion, using again a MLP. Evaluation is carried out on a subset of the MSD (329 K tracks by 24 K artists for which biographies and audio are available). Oramas et al. report mean average precision (MAP) values at 500 recommendations of up to 0.020 for their approach when evaluated in an artist recommendation task and up to 0.004 when recommending tracks.

Future Study

