A Udacity capstone project for Data Science Nanodegree program.
The capstone project is the last task in the Udacity Data Science Nanodegree program. I chose the Starbucks challenge for my capstone project. In a collaboration with Udacity, Starbucks provided a simulated data set that mimics customer behavior on the rewards mobile app.
Starbucks sends out an offer to users of the mobile app. An offer can be information or an actual offer such as a discount or BOGO (buy one get one free). These offers are classified into 10 offer IDs and an offer can take on any offer id.
Three data files were provided and are described below;
- Portfolio — the file contains information about offer ids and meta data about each offer like duration, offer type and difficulty.
- Profile — the file contains demographic information for each customer like the age, income and gender.
- Transcript.json — the file contains records for transactions, offers received, offers viewed, and offers completed
The overall objective of this project is to generate insights from Starbucks rewards mobile app that can be useful in optimizing the business use of the application.
To achieve the project objective, Explanatory Data Analysis is used answer the question; What is the impact of demographic and offer information on the success of an offer. In addition to the Explanatory Data Analysis, Machine Learning models are used to predict the success of a discount and BOGO offers.
We use the accuracy score and confusion metrics to evaluate the machine learning predict models created to predict the outcome of an offer. The accuracy score metrics is suitable in classification problems where the classes in the datasets are well balanced. In this project the classes in the target datasets are quite well balanced forinstance there are 12633 successful and 12683 unsuccessful discount offers.
Explanatory Data Analysis
Gender, age and income are some of the customer distiguishing features in the profiles data file. In the gender category, there are 3 classes ; female (F), male (M) and others (0).Grouping the profiles data by gender and age or income conclusions about the app users are made.
In total there are 14825 registered users of the mobile app, 8484 are Male users (57.2%), 6129 female users (41.4%) and 212 other gender (1.4%)
The distribution of the ages for the males and females in the dataset is very similar.The overall average, minimum and maximum ages are 62.5, 18 and 118 respectively.
The distribution of the income between the male and female gender is similar. The overall average, minimum, and maximum incomes are $ 65405, $30,000 and $120,000 respectively.
When is an offer successful or unsuccessful?
To answer this question, the profile, portfolio and transcript datasets are preproccessed and merge into one dataframe. There are 3 events one that an offer has to go through for it to be successful that is received, viewed and completed. For an offer to be considered a success, all the 3 events must have numeric values greater than zero. Offer received is always a numeric value as long as the customer is sent an offer. For an offer to be considered a success in this project, the product of the offer viewed and completed must be greater than 1 otherwise it is unsuccessful.
How does the offer outcome vary based on offer information?
In this project, the offer outcome can be successful or unsuccessful and to find the relationship between the offer outcome and offer information, I perform exploratory analysis on the merged dataframe.
In the offer information category, we consider the offer outcome in relation to offer_id, duration and difficulty.
1. Offer status per offer_id
Performing a count on the offer_id column, we find that offer id 5 has the highest success count and success rate of 17.7% while offer id 2 has the highest failure count and failure rate of 17.40%.
2. Duration and Difficulty
Offers with medium duration (6–8 day)have the highest success rate compared to those with very short (less than 4 days)or long duration (10 days).
Offers with moderate level of difficulty (10) achieve the highest success rate compare with those with higher level of difficulty (20). Offers with 0 level of difficult achieve 0 success becuase they are informative offers and not promotional.
Does the offer outcome vary with demographics?
Under this category, we consider the offer outcomes in relation to age, gender and income
1. Offer status per gender
Offer id 5 is most successful among male and female users while offer id 7 is the most successful in the other users category. Males has a higher total count and success rates compared to the females and other gender.
Offer id 3 is the most unsuccessful among the female and other users while offer id 2 for the users. The male users have the highest count of unsuccessful offers.
2. Offer status based on income per gender.
The offers are more unsuccessful for male users when incomes are below $60,000 compared to the female users. One can conclude that the female users’ offer success is less dependant on the income levels compared to males.
3. Offer status based on age per gender
For the male users, the offer success is low across all ages while for the female users there is no significant difference between the success and unsuccessful offers across the ages.
In this section, we use 2 machine learning models i.e Random Forest classifier (RFC) and the Support Vector Classifier (SVC) to predict whether an offer type will be success or not using the demographics and offer information. The model performance is evaluated using confusion matrices and accuracy scores.
It should be noted that there are 2 machine learning algorithms used but in 2 a 2 pharse process, first I fit the data to model with default parameters, evaluate the performance and then perform hyper parameter tuning with GridSeacrh fit the best parameter combinations to the data and re-evaluate the model performance.
There are 3 types of offers but since it is hard to know if a customer was influenced by the information received to transact, the focus is placed on only the discount and BOGO offers.
When working with classification models, it is very important that the classes are well balance to avoid classification bias. In this case the success and unsucceful offers of Discount type are well balance. However we notice that there is a slight imbalance in the classes in the BOGO type. We do not resample the data so that we can later assess the impact of class imbalance on the performance of the model.
Feature engineering .
We create 2 dataframes based on the offer type column in the aggregated dataframe; we select rows that match offer types Discount and BOGO. In the next step, we clean the data by interpolating the outliers and the missing variable. In the age column there is an outlier age (118), we replace it with the median age og the data set. The missing values in the income column are filled with the average income of the dataframe.
Dummy variables for both the categorical i.e. offer_id and gender and numeric variables i.e. difficulty and duration are created.
The last step is to create the X and Y variables ;the Y variable is the success column in the aggregated dataframe. The values in this column are binary (1 == success offer and 0 ==unsuccessful) The X variables date set conatins all columns except customer_id, year, month, success and membership_start.
We split the data using the SKlearn train_test_split package in a 0.7:0.3 train:test ratio . We rescale the data with the MinMaxScaler from Sklearn package.
Model 1: Random Forest classifier.
A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.
First, I fit the base model to X_train and Y-train features of the Discount and BOGO offers seperately, then make predictions and evaluate the model performance.
The base model is set with parameters; n_estimators equal to 10, criterion is entropy and max_features set as auto.
Model 2: Random Forest classifier (+GridSearch)
In the second model, I do hyper parameter tuning using grid search to find the optimal parameters for each data set. This is intended to improve the performance of the model. The parameters that were tune include n_estimators with options 100 and 200, max_features with options auto,sqrt and log2, max_depth with a uniform range of values from 4 to 8 and criterion with options gini and entropy.
The best parameters from the GridSearch are criterion is entropy, max_depth is equal to 8, max_features are selected automatically (‘auto’) and the n_estimators is set at 200. It is these parameters that are used to make predictions on both the discount and BOGO offer datasets.
Models 1 & 2 Evaluation.
When fitted to the Discount offer dataset, the base model has an accuracy score equal to 71 % and a grid search increases the score to 74%. On the other hand the base model gives a 68% accuracy score when fitted on the BOGO offer dataset. The grid search increases the accuracy score to 72%.
There is an overall improvement in the classifier after the parameter tuning. In the discount offer type there is an increase in the number of true positive predictions from 2556 to 2746 and a reduction in the false positive predictions from 1239 to 1049. In the BOGO offer type data set, the true positive predictions increase from 2149 to 2385 and the false positives reduce from 1382 to 1146.
1.Discount offer type
2. BOGO offer type
In the Discount offer type, the membership term i.e. how long one has been an app user explains 25% of the success of the offer and contributes that match to the predict accuracy of the model. This is followed by income and social media as the means of accessing the application, each contributing 14% and 12% respectively.
In the BOGO offer type, income explains 30% of the success of the offer and contributes that match to the predict accuracy of the model. This is followed by membership term and age, each contributing 29% and 13% respectively.
Model 3: Support Vector Classifier (SVC)
One of the advantages of SVC is that it uses a subset of training points in the decision function (called support vectors), so it is also memory efficient and it is for this reason that we decided to use it.
First, we fit the base model to X_train and Y-train features of the Discount and BOGO offers seperately, then make predictions and evaluate the model performance.
When using the SVC algorithm, we use the StandardScaler from the SKlearn package instead of the MinMaxScaler. The parameters in the base SVC model are defined as; gamma is set as auto, kernel to Radial Basis Function(rbf) and the C value (penalty to the error term) is set at 1.0.
Model 4: Support Vector Classifier (+GridSearch)
In order to improve the model peformance, we use GridSearch to tune the parameters. In this experiment, we try out different combinations of the C and gamma parameters while fixing the Kernel to Radial Basis Function (rbf).
The parameters tuned in during the GridSearch include the C value with options 0.1,1,10,100,100 and 1000, the gamma value with options 1,0.1,0.01,0,001 and 00.0001 and the kernel is set to Radial Basis Function (rbf)
The best parameters from the GridSearch are C value is set at 1, gamma is equal to 1 and the kernel is Radial Basis Function(rbf) . It is these parameters that are used to make predictions on both the discount and BOGO offer datasets.
Models 3 & 4 Evaluation.
When fitted to the Discount offer dataset, the SVC base model has an accuracy score equal to 73 % and a grid search increases the score to 74%. On the other hand the base model gives a 71% accuracy score when fitted on the BOGO offer dataset. The grid search does not change the overall model performance but there is a slight decrease in the true positives predicted from 2401 to 2382.
- Discount offer type
2. BOGO offer type
Comparing the model performance plots below, we see that hyper parameter tuned models perform generally better than the base model with an exception of the Support Vector Classifier models when applied to the BOGO offers data set.
The model performance is better when working with Discount offers than BOGO offer. This could be attributed to the imbalance in the classes with witnessed in the BOGO data set. Resampling the data set could help to compensate for the class imbalance and a general improvement in the models predictions.
When predicting offer outcomes in Discount offer type, the RandomForest Classifier and Support Vector Classifier models with GridSearch perform equally well with almost the same accuracy score (73.7% and 73.5% respectively). However the SVCwith GridSeacrh predicts a higher number of true positive compared to the RFC with GridSearch (2672 and 2385 respectively)
The Random Forest Classifier better with GridSearch is the best performing model when predicting the offer status of the BOGO offer type (accuracy score = 72%). This is slightly better than the SVC models which have accuracy scores equal to 71%.
In this capstone project, we have worked with data simulated data that mimics customer behavior on the Starbucks rewards mobile app.
- We carry explanatory data analysis to understand the distribution of the demographics information of the app users. The results show that there are more male app users compared to female users. The ages of the mobile app users range from 18 to 118 with the overall average age at 63. The income levels of the app users range from $30,000 to $120,000 with the overall average income equal to $65405.
- We then merge the portfolio, profile and transcripts datasets and further perform explanatory data analysis to understand how the offer and demographics information impact the outcome of the offer. The offer can be successful or unsuccessful.
i. The offer information, offer id, offer duration and difficulty impact the outcome of the offer. The results show that offers with offer id 5 are more likely to be successful while offers with offer id 2 are more likely to be unsuccessful. Offers with moderate durations are more likely to be success than those with very long durations. Offers with moderate difficulty levels are more likely to be success than high difficulty levels.
ii. The demographics information; age, gender and income impact the outcome of the offer. Male users have a higher total count and success rates compared to the female users and other users. This can be explained by the fact that there are more male users than female users and others. The female users’ offer success is less dependant on the income levels compared to male users. Offers are more unsuccessful among male users of all ages compared to the female users.
3. In the last section, we predicted the outcome of an offer type i.e Discount and BOGO offer type based on both the offer and demographics information. The results show that the term of membership i.e. how long a customer has used the app is the most determing factor when predicting the success of a discount offer. This could mean that the longer, a customer uses the app the more offers they are received or the more experience they have gained in utilizing the offers they receive. On the other hand, income is the most determining factor when predicting BOGO offer outcomes. The Random Forest Classifier with GridSearch model performances generally better than the Support Vector Classifier models when predicting the outcome in the Discount and BOGO offers.
The starbucks rewards mobile app gives 3 types of offers; informative offers, discount offers and BOGO (Buy On and Get one free ) offers. However because of the complexity in assess if the customer decision was influenced by the information received or not, we decided to focus on only the discount and BOGO offers.
Including the information offers in the predictive model is a potential area of interest and improvement. This can help Starbucks to know which demographics and/or offer information affect the outcome of the information offers and if it is worth including them as part of the rewards.
The question still stands:
How do information offers impact the decisions of the Starbucks mobile app users?
The jupyter notebook with the detailed analysis can be found her here