Insights from the Starbuck’s rewards mobile application data.

A Udacity capstone project for Data Science Nanodegree program.


Starbucks sends out an offer to users of the mobile app. An offer can be information or an actual offer such as a discount or BOGO (buy one get one free). These offers are classified into 10 offer IDs and an offer can take on any offer id.

Data Description

  1. Portfolio — the file contains information about offer ids and meta data about each offer like duration, offer type and difficulty.
  2. Profile — the file contains demographic information for each customer like the age, income and gender.
  3. Transcript.json — the file contains records for transactions, offers received, offers viewed, and offers completed

Project Objective

To achieve the project objective, Explanatory Data Analysis is used answer the question; What is the impact of demographic and offer information on the success of an offer. In addition to the Explanatory Data Analysis, Machine Learning models are used to predict the success of a discount and BOGO offers.

Performance metric

Explanatory Data Analysis

In total there are 14825 registered users of the mobile app, 8484 are Male users (57.2%), 6129 female users (41.4%) and 212 other gender (1.4%)

Histograms : Starbucks app users grouped by age and gender

The distribution of the ages for the males and females in the dataset is very similar.The overall average, minimum and maximum ages are 62.5, 18 and 118 respectively.

The distribution of the income between the male and female gender is similar. The overall average, minimum, and maximum incomes are $ 65405, $30,000 and $120,000 respectively.

When is an offer successful or unsuccessful?

Table: Aggregated data from profile, portfolio and transcript file

How does the offer outcome vary based on offer information?

In the offer information category, we consider the offer outcome in relation to offer_id, duration and difficulty.

1. Offer status per offer_id

Countplot and Tables: Offer status per offer_id

2. Duration and Difficulty

Offers with moderate level of difficulty (10) achieve the highest success rate compare with those with higher level of difficulty (20). Offers with 0 level of difficult achieve 0 success becuase they are informative offers and not promotional.

Distribution plots : Offer status based on offer duration and difficulty.

Does the offer outcome vary with demographics?

1. Offer status per gender

Table and contplot showing successful offers_ids per gender

Offer id 3 is the most unsuccessful among the female and other users while offer id 2 for the users. The male users have the highest count of unsuccessful offers.

Table and contplot showing Unsuccessful offers_ids per gender

2. Offer status based on income per gender.

Offer status for males and females based on income

3. Offer status based on age per gender

Offer status for Males and females based on age

Predictive Modelling

It should be noted that there are 2 machine learning algorithms used but in 2 a 2 pharse process, first I fit the data to model with default parameters, evaluate the performance and then perform hyper parameter tuning with GridSeacrh fit the best parameter combinations to the data and re-evaluate the model performance.

There are 3 types of offers but since it is hard to know if a customer was influenced by the information received to transact, the focus is placed on only the discount and BOGO offers.

When working with classification models, it is very important that the classes are well balance to avoid classification bias. In this case the success and unsucceful offers of Discount type are well balance. However we notice that there is a slight imbalance in the classes in the BOGO type. We do not resample the data so that we can later assess the impact of class imbalance on the performance of the model.

Visual inspection of imbalances in the data

Feature engineering .

Dummy variables for both the categorical i.e. offer_id and gender and numeric variables i.e. difficulty and duration are created.

The last step is to create the X and Y variables ;the Y variable is the success column in the aggregated dataframe. The values in this column are binary (1 == success offer and 0 ==unsuccessful) The X variables date set conatins all columns except customer_id, year, month, success and membership_start.

Customarized function for generating features.

Data preprocessing

Data splits and rescaling

Model 1: Random Forest classifier.

First, I fit the base model to X_train and Y-train features of the Discount and BOGO offers seperately, then make predictions and evaluate the model performance.

The base model is set with parameters; n_estimators equal to 10, criterion is entropy and max_features set as auto.

RFC base model

Model 2: Random Forest classifier (+GridSearch)

RFC with GridSearch.

The best parameters from the GridSearch are criterion is entropy, max_depth is equal to 8, max_features are selected automatically (‘auto’) and the n_estimators is set at 200. It is these parameters that are used to make predictions on both the discount and BOGO offer datasets.

RFC+GridSearch: Best parameters

Models 1 & 2 Evaluation.

There is an overall improvement in the classifier after the parameter tuning. In the discount offer type there is an increase in the number of true positive predictions from 2556 to 2746 and a reduction in the false positive predictions from 1239 to 1049. In the BOGO offer type data set, the true positive predictions increase from 2149 to 2385 and the false positives reduce from 1382 to 1146.

1.Discount offer type

Discount offer : Confusion matrix RFC and RFC+GridSearch

2. BOGO offer type

BOGO offer : Confusion matrix RFC and RFC+GridSearch

Feature importance.

In the BOGO offer type, income explains 30% of the success of the offer and contributes that match to the predict accuracy of the model. This is followed by membership term and age, each contributing 29% and 13% respectively.

Features importance: RFC+GridSearch

Model 3: Support Vector Classifier (SVC)

First, we fit the base model to X_train and Y-train features of the Discount and BOGO offers seperately, then make predictions and evaluate the model performance.

When using the SVC algorithm, we use the StandardScaler from the SKlearn package instead of the MinMaxScaler. The parameters in the base SVC model are defined as; gamma is set as auto, kernel to Radial Basis Function(rbf) and the C value (penalty to the error term) is set at 1.0.

SVC: Base model (Discount & BOGO)

Model 4: Support Vector Classifier (+GridSearch)

The parameters tuned in during the GridSearch include the C value with options 0.1,1,10,100,100 and 1000, the gamma value with options 1,0.1,0.01,0,001 and 00.0001 and the kernel is set to Radial Basis Function (rbf)

SVC with GridSearch.

The best parameters from the GridSearch are C value is set at 1, gamma is equal to 1 and the kernel is Radial Basis Function(rbf) . It is these parameters that are used to make predictions on both the discount and BOGO offer datasets.

GridSearch: Best parameters (Discount and BOGO)

Models 3 & 4 Evaluation.

  1. Discount offer type
Discount offer : Confusion matrix SVC and SVC+GridSearch

2. BOGO offer type

BOGO offer : Confusion matrix SVC and SVC+GridSearch

Model performance.

The model performance is better when working with Discount offers than BOGO offer. This could be attributed to the imbalance in the classes with witnessed in the BOGO data set. Resampling the data set could help to compensate for the class imbalance and a general improvement in the models predictions.

When predicting offer outcomes in Discount offer type, the RandomForest Classifier and Support Vector Classifier models with GridSearch perform equally well with almost the same accuracy score (73.7% and 73.5% respectively). However the SVCwith GridSeacrh predicts a higher number of true positive compared to the RFC with GridSearch (2672 and 2385 respectively)

The Random Forest Classifier better with GridSearch is the best performing model when predicting the offer status of the BOGO offer type (accuracy score = 72%). This is slightly better than the SVC models which have accuracy scores equal to 71%.

Summary model performance: Discount and BOGO offer


  1. We carry explanatory data analysis to understand the distribution of the demographics information of the app users. The results show that there are more male app users compared to female users. The ages of the mobile app users range from 18 to 118 with the overall average age at 63. The income levels of the app users range from $30,000 to $120,000 with the overall average income equal to $65405.
  2. We then merge the portfolio, profile and transcripts datasets and further perform explanatory data analysis to understand how the offer and demographics information impact the outcome of the offer. The offer can be successful or unsuccessful.

i. The offer information, offer id, offer duration and difficulty impact the outcome of the offer. The results show that offers with offer id 5 are more likely to be successful while offers with offer id 2 are more likely to be unsuccessful. Offers with moderate durations are more likely to be success than those with very long durations. Offers with moderate difficulty levels are more likely to be success than high difficulty levels.

ii. The demographics information; age, gender and income impact the outcome of the offer. Male users have a higher total count and success rates compared to the female users and other users. This can be explained by the fact that there are more male users than female users and others. The female users’ offer success is less dependant on the income levels compared to male users. Offers are more unsuccessful among male users of all ages compared to the female users.

3. In the last section, we predicted the outcome of an offer type i.e Discount and BOGO offer type based on both the offer and demographics information. The results show that the term of membership i.e. how long a customer has used the app is the most determing factor when predicting the success of a discount offer. This could mean that the longer, a customer uses the app the more offers they are received or the more experience they have gained in utilizing the offers they receive. On the other hand, income is the most determining factor when predicting BOGO offer outcomes. The Random Forest Classifier with GridSearch model performances generally better than the Support Vector Classifier models when predicting the outcome in the Discount and BOGO offers.


Including the information offers in the predictive model is a potential area of interest and improvement. This can help Starbucks to know which demographics and/or offer information affect the outcome of the information offers and if it is worth including them as part of the rewards.

The question still stands:

How do information offers impact the decisions of the Starbucks mobile app users?

The jupyter notebook with the detailed analysis can be found her here