Predicting a Song’s Success using Machine Learning

8 min readFeb 20, 2021

For my end of phase project in the Flatiron School Data Science bootcamp, I was tasked to select a dataset of my own and build a machine learning classifier. This being my first project where I get to choose the data myself, I wanted to find a dataset that not only fits the classification problem at hand, but also something that I’d genuinely enjoy working with. After scouring the web for this dataset, I finally found the one: The Spotify Hit Predictor Dataset (1960–2019).

The Data

This dataset has information on a vast selection of songs from the 1960s all the way to 2019. Not only were the track’s name and artist recorded, but also features such as a track’s danceability, energy, acousticness, and loudnesswere also present and mostly range from 0 to 1 (e.g., a value closer to 1 means the song is most danceable). More importantly, the reason this dataset served my purposes well was due to its binary target column, which indicated if the song was a hit or a flop (1 or 0). Here’s a preview of what the data looked like:

This also proved to be a good dataset in terms of size as I’d be working with over 40,000 rows and 19 columns. For more information on how a track was labelled as a hit (1) or flop (0), please click here.

The Problem

Composing a song is by no means an easy feat. Nowadays music of all different genres is available to a larger audience and tastes in music are more diversified than ever. This means that artists must take into account several factors that could potentially determine if the song is well-received by the public or not. Should my song be more energetic? Should my song be long or short? Should I focus on the song being loud? Or do people prefer softer, acoustic songs?

In an effort to assist music artists and minimise the risk of writing a “flop”, the goal of this project is aimed at providing predictions on a potential song’s success by taking into account song features like danceability, acousticness, loudness, and more.

The Methods Used

This project follows the Cross-Industry Standard Process for Data Mining (CRISP-DM). We begin by investigating the data using techniques such as exploratory data analysis with visualisations. Once we gain a solid understanding of the data we’re working with, data cleaning is performed to prepare our data for model development.

Having cleaned the data as much as possible, we create 4 candidate baseline machine learning classifiers with all our features/variables available. Once we evaluate their predictive performance, we try and optimise each classifier through methods such as hyperparameter tuning, grid search, recursive feature elimination, etc. In the end, the model with the best predictive performance is chosen.

Understanding the Data

The outputs shown below are a brief sample of the full exploratory data analysis (EDA) I ran in my Jupyter Notebook. To see the full EDA, please click here.

These histograms allow us to see how the data is distributed and provide insight on the nature (continuous vs. categorical) of each variable. Variables such as loudness and liveness do have skewed distributions and most songs have a danceability value at around 0.5. Furthermore, an important takeaway from this is that the response variable, target, is almost uniformly distributed. Finally, most songs are written in major scale (mode) and the distribution of energy lets us know that most songs have a high energy value.

Unsurprisingly so, The Beatles are the most successful band/artist in terms of number of hits in this dataset, followed closely by Sir Elton John!

The first plot on the left suggests “hit” songs tend to have high danceability and energy values. The high concentration of red circles (hit songs) in the upper-right corner explains this.
Is there a relationship between the valence of a song and its success? In other words, do happier songs succeed more than sad songs? The second plot seems to suggest this is the case by having the “hit” distribution slightly higher up than the flop distribution.
There must be a relationship between the number of sections a song has and the duration, right? The third and final plot confirms this by showing a strong, positive relationship between the two variables. However, there doesn’t seem to be a relationship with target given that most points are concentrated in the bottom-left corner.

Data Preparation

In this step of the project, I removed track, artist, and uri as these variables aren’t really shared among all tracks in the dataset. Whereas danceability, valence, and liveness are features that are common in any song, track, artist, and uri are unique to those particular tracks.

Furthermore, I also plotted a correlation heatmap of the predictor variables to inspect any multicollinearity problems. While most features had a relatively low correlation with one another, duration_ms and sections had a dangerously high correlation over 0.8. I removed sections to avoid any problems later on.

Modelling

The machine learning algorithms chosen for this project are: Logistic Regression, K-Nearest Neighbours (KNN), Random Forest, and Gradient Boosting. One way to assess how well the model does on unseen data is by performing a 70–30 train/test split of our data. Furthermore, the evaluation metrics recorded after we train each model will be: accuracy, F1 score, and AUC. ROC curves will also be plotted showing the relationship between the false positive rate and the true positive rate.

Using Python’s sklearn module, we were able to create 4 baseline classifiers. This means using all available features and no tuning performed. The results obtained for the baseline models are the following:

After reviewing the plots above, the important takeaways are:

The classifier with the largest difference between training and test performance is Random Forest. Given its impeccable performance on the training set, it seems this classifier *overfitted*
The best performing classifier in terms of AUC is Random Forest with a value close to 0.8
As far as test performance goes, Random Forest and GBoost are the best
Logistic Regression is the worst-performing classifier out of the 4

Now that we’ve reviewed our evaluation metrics, let’s take a look at the ROC curves for each classifier:

Model Optimisation/Tuning

This section of the project was handled the following way:

Logistic Regression: use a feature selection technique called Recursive Feature Elimination (RFE) to eliminate any redundant or disposable features. After that’s run, simulate several logistic regression models with different values for C, “the inverse of regularisation strength”, and see how it affects model performance. For more information on this parameter, please visit sklearn's LogisticRegression documentation.
KNN: after train/test splitting, scale the predictor variables since KNN is a distance-based algorithm. Similar to Logistic Regression, focus on a particular parameter (in this case n_neighbours) and simulate several KNN models to see how it affects performance.
Random Forest: run RFE to select the most important features from the dataset first. After that, find the optimal values for a select number of parameters using GridSearchCV. Since Random Forest is considered an ensemble method, it has several parameters that could potentially influence its performance. To avoid a computationally intensive exhaustive search, we focus on criterion, max_depth, and n_estimators.
Gradient Boost: run RFE to have the optimal number of features for this algorithm. Similar to Random Forest, perform a cross-validated grid search for the optimal values of the parameters learning_rate and max_depth.

To check out the full process of how I tweaked these ML algorithms, please click here.

Final Results and Model Selection

After choosing the best features and parameter values for each algorithm, the final results can be seen below:

The three plots above show us the difference in performance between the baseline model and the final. Here are the most important points:

Logistic Regression’s performance improved quite a lot in all areas
KNN’s training accuracy and F1 score dropped but was compensated by increasing test scores. The difference between training and test scores decreased
Random Forest is the algorithm that didn’t have much of a change in all areas. Unfortunately, the model still overfits the training data
Gradient Boost’s training scores improved over 80% while test scores increased minimally. Test AUC did increase too.

After reviewing all of the performance graphs shown above, Random Forest would seem like the best performing algorithm for this dataset and problem. It excels in training scores (both accuracy and F1) and has the steepest ROC curve out of all the classifiers. Furthermore, the fact that it’s an ensemble method makes it robust to variance and feature selection.

However, the chosen model for this project will have to be Gradient Boosting. This is precisely because of the Random Forest’s stellar training performance. The huge difference in training and test scores (for both accuracy and F1) suggest that Random Forest overfitted the training data and won’t perform as well on unseen data. In contrast, Gradient Boost has a smaller difference between training and test scores and has an AUC very close to Random Forest. Even though Gradient Boosting did not do exceptionally well on the training data, the results obtained suggest that it’s a more “balanced” algorithm than Random Forest.

Here are the raw results obtained from our Gradient Boost classifier: