Predicting offer-completeness on Starbucks offers

Raja Kyadhari

13 min readJul 1, 2021

References:

ScienceDirect.com | Science, health and medical journals, full text articles and books.

ScienceDirect is the world's leading source for scientific, technical, and medical research. Explore journals, books…

www.sciencedirect.com

https://www.businessofapps.com/ads/in-app/

https://machinelearningmastery.com/

https://towardsdatascience.com/

Project Overview:

Personalization and reduction in intrusiveness of advertising messages help to improve retention of customers, maximize marketing efficiencies, and improve the return on investments (ROI)

Every advertiser’s goal is peak relevance. And the way to relevance is segmentation. The narrower your audience segments become, the closer you get to delivering the 1:1 personalization that customers crave.

On similar lines, Starbucks in the simulated dataset provided by Udacity, targets its audience and customers through various kinds of offers namely: ‘BOGO’(buy one get one free),‘discount, and ‘’informational offers.

The channels used to reach the customers are also multiple namely: Web, email, Social media, and mobile.

Starbucks and Udacity have provided a simplified version of Starbucks app’s data set for this Capstone Project

Problem Statement:

With the given datasets and background information provided, I am interested in exploring the following with the Starbucks dataset for this Capstone project:

Predict the response to an offer — If the consumer will complete an offer or not?
Measure the accuracy of the model, precision, and F1-Score
Explore which input parameters/features play a critical role in predicting whether the customer will take the offer or not.

Data files :

The data is contained in three files:

portfolio. json — containing offer ids and meta data about each offer (duration, type, etc.)
profile. json — demographic data for each customer
transcript. json — records for transactions, offers received, offers viewed, and offers completed

Project Design

Loading and exploring the data — The datasets will be loaded, and some exploratory analysis will be performed in this stage such as reviewing the distribution of data, number, and type of features ,dimensions of the dataset, statistical summary of attributes, and data visualization with histograms on input variables to get idea on their distribution
Data preprocessing and feature engineering: Data cleansing and feature engineering will be performed in this stage .From the transaction data set, I intend on creating additional attributes based on existing to study their impact on offer completion. Examples could include since how long the user has been a member, # of offers received by a user, # of times the offer has converted to a buy and so on.
Splitting the data into train/test sets :We will split the data into training and test data sets and derive test/train features and labels
Modeling: Defining and training a Decision Tree and Random Forest binary classifiers
Making improvements on the model : This will be an incremental step. We will adjust the input parameters/features that feed the model to improve the accuracy rate as applicable. We will perform model tuning to optimize the metrics we are interested in
Evaluating and comparing model test performance: We will measure the accuracy, precision, recall and F1-Score for both our models and suggest the better model for predicting an Offer to be completed

Metrics

As described in the Udacity course module: Precision & Recall Precision and recall are just different metrics for measuring the “success” or performance of a trained model.

Precision is defined as the number of true positives (truly fraudulent transaction data, in this case) over all positives, and will be the higher when the number of false positives is low.
recall is defined as the number of true positives over true positives plus false negatives and will be higher when the number of false negatives is low. Both consider true positives and will be higher for high, positive accuracy, too
F1-Score is the weighted average of Precision and Recall

Data Exploration

portfolio.json — contains offer ids and meta data about each offer (duration, type, etc.)

Statistics:

The file contains 10 entries and no null values are present in any if the columns.

id indicates the ‘offer id’ associated with each offer type • The file contains details on the different Offer Types namely ‘bogo’,’discount’, and ‘informational’
There is a channels column indicating how each of the offer types are marketed
The potential features from this dataset such as ‘difficulty’, ‘duration’, and ‘reward’ are on different scale and we will need to normalize these values by applying a scalar before feeding the models
The channels column is of type categorical. The column needs to be converted by applying one-hot-encoding before passing it as a feature to the model

2.profile.json — demographic data for each customer

Gender is categorical in nature and we will convert it to columns after cleaning with one-hot-coding technique

We will convert the became_member_on column to a more meaningful feature such as ‘member_since’ which will hold the # of days since the customer has been a member

Missing values in gender and income column correspond to value 118 in the age column
There are quite a few recommendations to handle missing values in a table
Determine if they are outliers and can be potentially dropped from the dataset
If dropping the missing records is not a viable option, consider filling in the missing values with the appropriate technique — fill it with the mean value of the column or the mode of the column. In this case, as these potential customers could have made valid transactions, I decided to go with the approach of filling the missing income values with the mean of the income column and filling the gender column with the mode that is the most frequently appearing entry in the dataset.
As seen, the value of 118 in the age column does not seem to be accurate either and seems to be a value entered in cases where the customer has not provided sufficient information. In the case of records with age value of 118, I have decided to substitute it with the mean of age so that entries fall in a reasonable age range

3.transcript.json — records for transactions, offers received, offers viewed, and offers completed

Transaction type offers are associated with an amount in the value column and not an offer id

• Offer id is embedded in the value column. As part of cleanup, we will extract offer id and offer_id into a separate column to store offer id values

• According to definition a customer is said to have completed an offer if the following sequence is followed: Offer received -> Offer viewed -> offer completed This is specifically true for bogo(Buy one, get one ) and discount type offers and not for informational offers

Joining transcript dataset with portfolio and profile dataset provide more meaningful insights into data. There are no null values in the dataset

Data exploration and visualization

Observations:

More males seem to have completed an offer compared to females
People in the age group 49–69 seem to be the most to have successfully completed an offer
People in the 60–70k income bucket seems to have the completed the offer most
These are interesting key points as we can validate how they play out when the feature importance will be derived

Algorithms and Techniques:

Since we are dealing with a classification problem, I decided to use a simple logistic regression model as the benchmark model. The other two algorithms used for classification are Decision trees and Random forest. Decision trees are one of the most commonly used predictive modeling algorithms in many classification and predictive applications. They come with certain distinct advantages making them a good choice for classification related tasks.

The ability to precisely classify observations is extremely valuable for various business applications like predicting whether a particular user will buy a product or forecasting applications Random forest is at the top of the classifier hierarchy Random forest (RF), is an ensemble classification scheme that utilizes a majority vote to predict classes based on the partition of data from multiple decision trees.

Confusion matrix for the combined dataset of BOGO and discount type offers

Parameters used with Decision tree classifier: DecisionTreeClassifier(criterion=’entropy’,max_depth=5,random_state=2,min_samples_split=90,min_samples_leaf=50) #DecisionTree

-max_depth: int or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples

min_samples_split: int, float, optional (default=2) The minimum number of samples required to split an internal node: min_samples_leaf: int, float, optional (default=1) The minimum number of samples required to be at a leaf node.
min_weight_fraction_leaf: float, optional (default=0.) The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node.
max_features: int, float, string or None, optional (default=None) The number of features to consider when looking for the best split:
random_state: int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

Parameters used with Random forest classifier: RandomForestClassifier(random_state=2,max_depth= 11, max_features= ‘auto’,min_samples_split=10,n_estimators=20,min_samples_leaf=20)

#EnsembleMethods

• n_estimators = number of trees in the foreset

max_features = max number of features considered for splitting a node
max_depth = max number of levels in each decision tree
min_samples_split = min number of data points placed in a node before the node is split
min_samples_leaf = min number of data points allowed in a leaf node
bootstrap = method for sampling data points (with or without replacement)

Gradient descent-based algorithms such as logistic regression is sensitive to scaling of features. To ensure the gradient descent converges smoothly, it is critical to scale the features. Tree based algorithm such as Decision tree and ensemble models such as Random forest are invariant to scaling of features and their performance is not much impacted when this technique is applied Benchmark

Refinement

I considered using 3 separate datasets for testing the model’s performance against different test samples
I also tried taking age and income as range buckets. That way they are not necessarily normalized. The accuracy and f1-score dropped by about 1% across the board
Usage of Minmax scaler or standard scaler did not impact the performance of the logistic regression model
Finally, I also tried using Grid search to optimize the parameters and fine tune the model
Random Forest classifier came up to be the best classifier for our project RandomForestClassifier(bootstrap=True, class_weight=None, criterion=’gini’, max_depth=None, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=10, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=1, oob_score=False, random_state=42, verbose=0, warm_start=False)

Results

Model Evaluation and Validation:

We will test the data against 3 different models and evaluate the performance

Logistic Regression is our benchmark model. We have considered Decision Tree and Random Forest classifiers as the other 2 models to predict whether the customer will complete the offer or not
3 datasets are considered — BOGO only, Discount only and a combination of both to gauge performance of the model
For this project, I have not considered the informational offers as they do not have a clear set path to measure offer completeness
A confusion matrix is plotted for each of the datasets which gives us an idea on how the model is performing against false positives and false negatives (precision and recall) which are important metrics to evaluate the performance of the models other than accuracy
Finally we apply model tuning to optimize the model .In this case I have applied the model tuning to the combination dataset (bogo + discount)
Decision Tree and Random Forest classifiers perform better than the benchmark model Logistic Regression
For bogo dataset, we had 5090 test samples — We have decision tree and Random forest produce about 70% accuracy and 70% f1-score whereas logistic regression is around 64% for accuracy and 62% for f1-score
For discount, we had 4286 test samples — We have decision tree and Random forest produce about 69% accuracy and 69% f1-score whereas logistic regression is around 66% for both accuracy and f1-score
For the combined dataset, we had 9376 test samples We have decision tree and Random forest produce about 70% accuracy and 70% f1-score whereas logistic regression is around 64% for both accuracy and f1-score

We can definitively suggest using either the decision tree or the Random Forest classifier as the recommended model based on the results. They both have a decent accuracy, precision, recall, and f1-score to gauge the performance of the model.

-A decent precision and recall also shows low false positives and low false negatives which means when a model predicts than an offer will completed based on the features, it stands a better chance to be true and same is the case when the model predicts that an offer will not be completed.

The other two algorithms used for classification are Decision trees and Random forest. Decision trees are one of the most commonly used predictive modeling algorithms in many classification and predictive applications. They come with certain distinct advantages making them a good choice for classification related tasks.

Many algorithms require scale normalization before model building and application. Such variable transformations are not required with decision trees because the tree structure will remain the same with or without the transformation. This can come handy when dealing with a large # of features which are in different scale/magnitude

-Decision trees are also not sensitive to outliers since the partitioning happens based on the proportion of samples within the split ranges and not on absolute values. The only caveat they present is that without limiting tree growth, they tend to overfit the training data

One suggestion is to incrementally update by splitting dataset into smaller dataset The ability to precisely classify observations is extremely valuable for various business applications like predicting whether a particular user will buy a product or forecasting applications Random forest is at the top of the classifier hierarchy Random forest (RF), is an ensemble classification scheme that utilizes a majority vote to predict classes based on the partition of data from multiple decision trees. The ensemble technique combines the indicators from multiple trained classifiers to classify new instances. The reason for Random forest classifiers performances is: A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the individual constituent models While some trees may be wrong, many other trees will be right, so as a group the trees are able to move in the correct direction Random forests can handle large numbers of variables in a dataset. Also, during the forest building process they generate an internal unbiased estimate of the generalization error. In addition, they can estimate missing data well A major drawback of random forests is the lack of reproducibility, as the process of building the forest is random.

To test the robustness of the models, I tried against 3 sets of data:

BOGO only dataset
Discount only dataset
Combination of both BOGO and discount datasets.

Results for combined (BOGO + discount) datasets

Justification

Benchmark model: Logistic Regression results

Model 1 results: Decision Tree Classifier

Model 2 results: Random Tree Classifier

Clearly both the Decision Tree Classifier based model and the Random Tree Classifier based model performed better than our benchmark Logistic Regression model As stated in the metrics section, Precision, recall, and f1-score are other measures to validate the performance of the model apart from accuracy. Both Decision Tree and Random Forest had decent precision, recall, and f1-score of about 70% with accuracy also about 70%. This shows that the models are more reliable in terms of reporting low number of false positives and false negatives. This means the models are also reliable in predicting whether a customer would truly complete an offer or not

Conclusion

Features Importance:

Reflection

We find by plotting the feature importance that ‘member_since’,’income’, and ‘age’ are the top 3 features with the highest weightage.’member_since’ was a new column added as part of feature engineering based on ‘became_member_on’ date column
After evaluating the model results and looking at the feature importance results, I am convinced that the models suggested in this project can be used to predict if an offer would be completed or not.

I started this project with certain goals as stated in the problem statement of my proposal:

• Predict the response to an offer - If the consumer will complete an offer or not?• Measure the accuracy of the model, precision, and F1-Score• Explore which input parameters/features play a critical role in predicting whether the customer will take the offer or not.

I believe the answers to the above questions have been answered to my satisfaction with compelling statistics and results.

The Udacity Data Scientist Nanodegree program has been a rewarding experience with hands on experience on well laid out projects

Improvement

I have only considered BOGO and discount datasets for this project as they had a clear cut way to predict the offer completeness.
As an improvement, I would also like to consider the informational offer types and see how that works against the suggested models
I would also be interested to test other algorithms to see if there is any variance in the performance metrics reported