How I built a simple AI model to convert more sales on Shopify
Have you ever thought about leveraging your existing e-commerce data to identify visitors with a high intention to purchase on your online store?
Doing so is relatively easy and will enable you to better prioritize your marketing efforts towards visitors likely to convert, producing superior return on marketing investment and put it simply: will grow your sales. In this post I highlight the steps I implemented to get 91% accuracy using basic site logs, google analytics data, python and sci kit learn models. This can be implemented in less than a day, just reach out to me directly if you prefer have someone do it for you.
We will run through the following steps to set this up:
Data collection and cleaning
Import libraries and pre process data
Setup a pipeline with our model then train & test
Final model selection and tuning with hyper parameters
Put into production
In order to step through this process I used analytics data from a UCI Machine learning database not to expose sensitive data from my clients.
You can access the notebook used in this demo on github: https://github.com/sjt80/ecommerce-customer-intention
1. Collect data
Regardless of what you are selling you will need to track your site visitors and their session activity (category/sections viewed, pages viewed, time spent on each, if they are new or returning visitor, etc, the more data the better) and finally but most importantly record if they made a purchase during their session or not (in our case our target is “Revenue” a simple true or false value). Sourcing the data can easily be done using Google Analytics pasting a line of code in your sites header. GA is free, the industry norm and there are excellent tutorials on how to set this up in under an hour.
The dataset we are using looks like this:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 12330 entries, 0 to 12329 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Administrative 12330 non-null int64 1 Administrative_Duration 12330 non-null float64 2 Informational 12330 non-null int64 3 Informational_Duration 12330 non-null float64 4 ProductRelated 12330 non-null int64 5 ProductRelated_Duration 12330 non-null float64 6 BounceRates 12330 non-null float64 7 ExitRates 12330 non-null float64 8 PageValues 12330 non-null float64 9 SpecialDay 12330 non-null float64 10 Month 12330 non-null object 11 OperatingSystems 12330 non-null int64 12 Browser 12330 non-null int64 13 Region 12330 non-null int64 14 TrafficType 12330 non-null int64 15 VisitorType 12330 non-null object 16 Weekend 12330 non-null bool 17 Revenue 12330 non-null bool dtypes: bool(2), float64(7), int64(7), object(2) memory usage: 1.5+ MB
2. Import libraries Pre process data
We use a number of libraries not only for our models but to process our data and optimize our model. In this case we will use K Nearest Neighbors as our model, a supervised algorithm ideal for classification problems with relatively fast training:
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import Pipeline from sklearn.compose import make_column_transformer from sklearn.metrics import accuracy_score, recall_score, precision_score, precision_recall_curve, roc_curve, RocCurveDisplay, confusion_matrix from sklearn.model_selection import train_test_split, GridSearchCV from sklearn import set_config set_config("figure")
Our data is already fairly clean with no missing values or strange strings to clean up however we want all our data to be in numerical format AND to be scaled for our model to work optimally. To do this in our pipeline we will build later we will create a simple column transformer as follows:
transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['Month', 'VisitorType', 'Weekend']), remainder = StandardScaler())
Note: how we have hardcoded ‘Month’, ‘VisitorType’, ‘Weekend’. As the values we want to one hot encode (convert from text values to numerical). For each of these columns new columns will be created; one for each variable within it and the original column will be dropped, all other numerical columns will be scaled. For example Month will be turned into 12 columns with a 1 or a 0 depending on the rows month value. We could also have added some code to select data of type object rather than hardcode the column values but for so little typing of code it wasn’t worth it.
Finally before running anything we want to split our data for training and testing:
X_train, X_test, y_train, y_test = train_test_split(sessions.drop('Revenue', axis = 1), sessions.Revenue, random_state=42, stratify = sessions.Revenue)
We use the default split that will use 75% of our data for training & use stratify to make sure our proportional results are the same in training as in test (ie: reflect the real world).
3. Setup a pipeline, our model then train & test
Using pipelines from scikit learn is a good idea (see my other post on this) as they are cleaner to change and easier to optimize without getting messy. In our case I setup a pipeline with the transformer we built earlier then the K Nearest Neighbors model, we can then fit the model with our training data. Now we are able to run some test predictions with our test data:
knn_pipe = Pipeline([ ('transform', transformer), ('knn', KNeighborsClassifier(n_neighbors = 10)) ]) knn_pipe knn_pipe.fit(X_train, y_train) test_predictions = knn_pipe.predict_proba(X_test)
4. Tuning our model for desired results
Using our test data to make test predictions enables us to simply evaluate our current models accuracy however we will dive a little deeper with tuning Hyper parameters to find the best setup for our task. The two main parameters we can adjust in our KNN model are Decision Thresholds & K. As can be seen above we set n_neighbors = 10 as a starting point. Put simply K (ie: n_neighbors) is a value that our algorithm uses to find the distance between a given datapoint and K others to decide what to classify/label it. As a default if our number of neighbors ≤ 50% then our model will predict the new point to be 0 above this…you guessed it 1. Something very usefull is we can also extract the probabilities for each given datapoint being labeled one class or another using predict_proba. By changing the threshold we can essentially tell our model what probability we want as a result to label the point as a 0 or 1 (or True or False). We will cover why this might be needed later in this post. So now we have two settings we can play with: we can change K the amount of points we use to classify and/or we can play with the threshold to decide how we classify the points.
To do this programmatically we don’t need to input an re run models hundreds of times. We use Grid Search Cross Validation or GridSearchCV() to run our pipeline with a dictionary of hyperparameters we want to test for and let it give us the best parameters in this case ‘roc_auc’:
roc_grid = GridSearchCV(knn_pipe, param_grid={'knn__n_neighbors': range(1, 63, 2)}, scoring = 'roc_auc') roc_grid.fit(X_train, y_train) best_k = roc_grid.best_params_['knn__n_neighbors']
To understand roc_auc is beyond this post however it should suffice to say we are trying to maximize the area under our ROC curve. The larger the area the better our model. Below is a ROC curve showing how changing the decision threshold has an effect on our predictions:
We can see that K was selected as 47.
So why would we want to change our decision threshold?
When making a prediction for a two-class classification problem, there are two types of errors the model can make:
False Positive. Predict an event when there was no event (eg: the visitor will not make a purchase).
False Negative. Predict no event when in fact there was an event (eg: the visitor will make a purchase).
By calibrating a threshold based on accuracy and recall (aka: sensitivity), a balance of these two concerns can be chosen to help us limit risks of bad predictions with our model.
Choosing our decision threshold: For our case there is no catastrophic outcome if we wrongly predict a false positive (aka we predict a person will make a purchase but they don’t). As we use this data to trigger sales and marketing activities, in this case the better tradeoff to capture all potential customers would be to reduce the amount of False negatives, and maybe have a few more False positives. We would do this by selecting a threshold on the lower end say 0.4. Essentially a low precision and high recall or in other words we have an optimistic classifier/model. On the other hand if we have limited product and a very costly sales follow up process we would want the inverse and have high precision and low recall.
5. Put into Production
Now we have a custom trained model for our specific data and use case we can use our model with the selected parameters using knn_pipe.predict(X) to classify new site visits. This could be as simple as downloading analytics and loading the data locally into our jupyter notebook on a weekly or bi weekly basis top tag users with high intention. We can also create a simple streamlit application for anyone to upload our sites analytics data to get better data visualization. Or we can setup a more advanced integration for realtime prediction using fastapi integrated with your datasources. I am planning some posts to show how to do this. Please don’t hesitate to reach out with any questions or help with your own implementations.