Starbucks — Capstone Project

10 min readSep 15, 2021

Udacity Data Scientist Nanodegree Program

Project Definition

Project Overview:

The simulated data available from Starbucks mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. Not all users receive the same offer, and that is the challenge to solve with this particular data set. The task here is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Problem Statement:

The objective is to combine the 3 datasets named portfolio, profile, transcripts which contains data related to transactions,offer and demographics of the customers into a single one and then use it train the machine learning model against a particular target variable.

Case 1: offer type as target variable

Case 2: event type as target variable

Portfolio: containing offer ids and meta data about each offer (duration, type, etc.)

Profile: demographic data for each customer

Transcript: records for transactions, offers received, offers viewed, and offers completed

Metrics:

Out of many metrics available in the sklearn framework, accuracy_score was selected as the evaluation metric as it is considered as the weighted average of precision and recall. Moreover Accuracy score is the measure of all the correctly identified cases.It is normally used when all the classes are equally important and the dataset is balanced. The best performing model will have a accuracy value near 100 and the worst model will have accuracy_score closer to 0. The accuracy score above 90% is a high enough number for the model to be considered as ‘accurate’.

Analysis

Shapes of the datasets:

Portfolio: (10, 6)

Profile: (17000, 5)

Transcript: (306534, 4)

Merged dataframe: (167581, 25)

Each classes have nearly same number of features and some features are common among a pair of classes.This fact is used to merge the dataframes along the common feature. eg: As ‘profile’ and ‘transcript’ have a common column.. ie ‘customer_id’,those to dataframes could be merged along that column.Similarly ‘portfolio’ and ‘transcript’ have a common column — ‘offer_id’.We merge those along that column to form a single dataframe.

Data Exploration and Visualization:

a) Income distribution

The income distribution looks like a skewed normal distribution with most customers having an income near 70000 per annum.

b) Gender based distribution

So we have higher male proportion than females among the customers.

(i) Event type based on gender

(ii) Offer type based on gender

The count of male customers dominated over female counterparts in every individual subsection as well.

c) Age distribution

The age distribution is similar to that of a normal distribution curve.

Methodology

So the basic idea here is to preprocess each dataframe seperately in order to derive features from them and then combine all the 3 dataframes to form a single dataframe which can be fed to the model. This model will predict the respective target variable given to it.

Data Preprocessing:

We need to remove or replace null values as well as infinite values and make the data clean. After that we need to convert the string type and other datatype variables into integer type using mapping methods. The model cannot process features in string format. So we need to convert it into numbers so that they can be used in equations to calculate the cost function which should be optimized in order to train the model for better predictions. This is done by map function in python.

Implementation:

Preconditioning the model — — Min-Max Scaling

The data features can have different range of values and their absolute values have a big impact on performance of the model. So the significance of different features can be balanced by rescaling these range of values to a same interval.

Principal Component Analysis (PCA)

This is done to identify the features which are highly connected to the target variable, so that we can reduce the number of features by maintaining the required variance and that make the model training even more efficient.

So with 16 features we were able to capture almost 99% of the variance out of the non linear data. In case of too many features we can remove the unnecessary features and only give the required one to the model. In this case as we only have a few more than 16, let’s give it all to the model and see how it performs with that.

Test-Train Splitting

Now the data should be splitted into test set and train set inorder to validate the predicting power of the ML model. Here we will be performing an 80%-20% split. Initially we need to choose a target variable which would be predicted by the model.

Modelling

The models that will be trained and benchmarked for two different cases are as follows,

1) KNeighborsClassifier — Classifier implementing the k-nearest neighbors vote.

2) RandomForestClassifier — A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

3) AdaBoostClassifier — An AdaBoost classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

4) ExtraTreesClassifier — An extra-trees classifier is a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

5) BaggingClassifier — A Bagging regressor is an ensemble meta-estimator that fits base regressors each on random subsets of the original dataset and then aggregate their individual predictions (either by voting or by averaging) to form a final prediction. Such a meta-estimator can typically be used as a way to reduce the variance of a black-box estimator (e.g., a decision tree), by introducing randomization into its construction procedure and then making an ensemble out of it.

Apart from KNNClassifier model, default hyperparameters were used for all the models as the data was not complex enough in order to do manual tuning for hyperparameters. In the case of KNN Classifier, the number of neighbours were randomly assigned to 6. One of the difficulties while training the model was the presence of NaN/infinity values in the dataset, which I missed during the data cleaning. After removing those values the training process went smoothly.

Refinement

With the preconditionting of the model followed by the principal component analysis, I was able to capture almost 99% of the variance from the data. So I didn‘t had to go through major tweaking of the hyperparameters of the model. With the default hyperparameters itself, I was able to obtain required accuracy. Efficient preprocessing of the data might also have contributed towards the smooth training of the model.

Results

Model Evaluation and Validation

Case 1: Target Variable — Offer Type

Here the model predicts the offer_type using all the other features.ie it classiffies the data into corresponding offer types,ie

a) bogo

b) discount

c) informational

Training score comparison

Testing score comparison

After training the model using the training dataset,the model with trained parameters were used to make predicitons on the test set. Out of this the Bagging Classifier model stood out with high accuracy in both the cases. This may be due to the averaging process inside the model by which it takes the average of predictions of base estimators like a decision tree.

This shows that apart from KNN model and AdaBoost model all the other three model performs really well to predict the type of offer. The Highest test_score is shown by Bagging Classifier which is the best to predict the type of offer. In predicitng the offer type,bagging classifier has the highest training as well as testing accuracy. So that wins the battle among all the other models in this particular case.

Case 2: Target Variable — Event type

Here the model predicts the event_type using all the other features.ie it classiffies the data into corresponding event types,ie

a) offer received

b) offer viewed

c) offer completed

Training score comparison

Testing score comparison

Here apart from KNN model every other model performs really well with accuracy above 90% in the case of training dataset. This may be due to the effective data preprocessing followed by the exploratory data analysis. But when it comes to validation of the model performance using the testing dataset, only the AdaBoost model performs well. So Adaboost classifier is the best one for this case. The performance of other models might have degraded due to the issue of overfitting. This can be rectified using addition of noise or with some regularization techniques.

The reason for the better performance of Adaboost in this case is as follows:

It uses multiple instances of the same classifier with different parameters. Thus, a previously linear classifier can be combined into nonlinear classifiers. Or, as the AdaBoost people like to put it, multiple weak learners can make one strong learner.

Conclusion

Initially the data in the form of 3 dataframes was processed separately using different techniques. Features were extraxcted from these individual dataframes and then cleaned to remove null values,infinite values etc. Now Feature engineering was performed in which diferent columns were extracted from some columns to make the data more readable to the model. One the individual dataframes were cleaned and modified,all of them were merged to form a single dataframe. Now this was fed to the models.So the model training and predictions were carried out for two cases by changing the target variable so that the models could predict multiple entities seperately. Initilly the model predicts the type of offer and then the type of event. 5 different ML models were taken from Sklearn framework and were compared using the merged dataframe. The results of the model predicitions were also were also generated in order to find the model which performed well regarding the scopes for improvement. The data set could have been made better by eliminating so many null values which I had to replace with appropriate counterparts. This would have reduced the quality of data. Other than that if we had more features to work with, we could have done PCA on a better data pool and obtained better results.

Improvements

— Availability of more data would always help in making better predicitions.

— Too much NaN values will lead to the decrease in quality of data once it is replaced or removed.This could have improved.

— Availability of more features would have improved the performance of model.

— In some model the testing accuracy was lower than that of training accuracy.This could be rectified by adding some regularization techniques or adding noise to the data.