How I built a simple AI model to convert more sales on Shopify

Have you ever thought about leveraging your existing e-commerce data to identify visitors with a high intention to purchase on your online store?

Doing so is relatively easy and will enable you to better prioritize your marketing efforts towards visitors likely to convert, producing superior return on marketing investment and put it simply: will grow your sales. In this post I highlight the steps I implemented to get 91% accuracy using basic site logs, google analytics data, python and sci kit learn models. This can be implemented in less than a day, just reach out to me directly if you prefer have someone do it for you.

We will run through the following steps to set this up:

Data collection and cleaning
Import libraries and pre process data
Setup a pipeline with our model then train & test
Final model selection and tuning with hyper parameters
Put into production

In order to step through this process I used analytics data from a UCI Machine learning database not to expose sensitive data from my clients.

You can access the notebook used in this demo on github: https://github.com/sjt80/ecommerce-customer-intention

1. Collect data

Regardless of what you are selling you will need to track your site visitors and their session activity (category/sections viewed, pages viewed, time spent on each, if they are new or returning visitor, etc, the more data the better) and finally but most importantly record if they made a purchase during their session or not (in our case our target is “Revenue” a simple true or false value). Sourcing the data can easily be done using Google Analytics pasting a line of code in your sites header. GA is free, the industry norm and there are excellent tutorials on how to set this up in under an hour.

The dataset we are using looks like this:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12330 entries, 0 to 12329
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Administrative           12330 non-null  int64  
 1   Administrative_Duration  12330 non-null  float64
 2   Informational            12330 non-null  int64  
 3   Informational_Duration   12330 non-null  float64
 4   ProductRelated           12330 non-null  int64  
 5   ProductRelated_Duration  12330 non-null  float64
 6   BounceRates              12330 non-null  float64
 7   ExitRates                12330 non-null  float64
 8   PageValues               12330 non-null  float64
 9   SpecialDay               12330 non-null  float64
 10  Month                    12330 non-null  object 
 11  OperatingSystems         12330 non-null  int64  
 12  Browser                  12330 non-null  int64  
 13  Region                   12330 non-null  int64  
 14  TrafficType              12330 non-null  int64  
 15  VisitorType              12330 non-null  object 
 16  Weekend                  12330 non-null  bool   
 17  Revenue                  12330 non-null  bool   
dtypes: bool(2), float64(7), int64(7), object(2)
memory usage: 1.5+ MB

2. Import libraries Pre process data

We use a number of libraries not only for our models but to process our data and optimize our model. In this case we will use K Nearest Neighbors as our model, a supervised algorithm ideal for classification problems with relatively fast training:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import accuracy_score, recall_score, precision_score, precision_recall_curve, roc_curve, RocCurveDisplay, confusion_matrix
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config
set_config("figure")

Our data is already fairly clean with no missing values or strange strings to clean up however we want all our data to be in numerical format AND to be scaled for our model to work optimally. To do this in our pipeline we will build later we will create a simple column transformer as follows:

transformer = make_column_transformer((OneHotEncoder(drop = 'if_binary'), ['Month', 'VisitorType', 'Weekend']),
                                     remainder = StandardScaler())

Note: how we have hardcoded ‘Month’, ‘VisitorType’, ‘Weekend’. As the values we want to one hot encode (convert from text values to numerical). For each of these columns new columns will be created; one for each variable within it and the original column will be dropped, all other numerical columns will be scaled. For example Month will be turned into 12 columns with a 1 or a 0 depending on the rows month value. We could also have added some code to select data of type object rather than hardcode the column values but for so little typing of code it wasn’t worth it.

Finally before running anything we want to split our data for training and testing:

X_train, X_test, y_train, y_test = train_test_split(sessions.drop('Revenue', axis = 1), sessions.Revenue,
                                                   random_state=42,
                                                   stratify = sessions.Revenue)

We use the default split that will use 75% of our data for training & use stratify to make sure our proportional results are the same in training as in test (ie: reflect the real world).

3. Setup a pipeline, our model then train & test

Using pipelines from scikit learn is a good idea (see my other post on this) as they are cleaner to change and easier to optimize without getting messy. In our case I setup a pipeline with the transformer we built earlier then the K Nearest Neighbors model, we can then fit the model with our training data. Now we are able to run some test predictions with our test data:

knn_pipe = Pipeline([
    ('transform', transformer), 
    ('knn', KNeighborsClassifier(n_neighbors = 10))
])
knn_pipe
knn_pipe.fit(X_train, y_train)
test_predictions = knn_pipe.predict_proba(X_test)

4. Tuning our model for desired results

Using our test data to make test predictions enables us to simply evaluate our current models accuracy however we will dive a little deeper with tuning Hyper parameters to find the best setup for our task. The two main parameters we can adjust in our KNN model are Decision Thresholds & K. As can be seen above we set n_neighbors = 10 as a starting point. Put simply K (ie: n_neighbors) is a value that our algorithm uses to find the distance between a given datapoint and K others to decide what to classify/label it. As a default if our number of neighbors ≤ 50% then our model will predict the new point to be 0 above this…you guessed it 1. Something very usefull is we can also extract the probabilities for each given datapoint being labeled one class or another using predict_proba. By changing the threshold we can essentially tell our model what probability we want as a result to label the point as a 0 or 1 (or True or False). We will cover why this might be needed later in this post. So now we have two settings we can play with: we can change K the amount of points we use to classify and/or we can play with the threshold to decide how we classify the points.

To do this programmatically we don’t need to input an re run models hundreds of times. We use Grid Search Cross Validation or GridSearchCV() to run our pipeline with a dictionary of hyperparameters we want to test for and let it give us the best parameters in this case ‘roc_auc’:

roc_grid = GridSearchCV(knn_pipe, param_grid={'knn__n_neighbors': range(1, 63, 2)}, scoring = 'roc_auc')
roc_grid.fit(X_train, y_train)
best_k = roc_grid.best_params_['knn__n_neighbors']

To understand roc_auc is beyond this post however it should suffice to say we are trying to maximize the area under our ROC curve. The larger the area the better our model. Below is a ROC curve showing how changing the decision threshold has an effect on our predictions:

We can see that K was selected as 47.

So why would we want to change our decision threshold?

When making a prediction for a two-class classification problem, there are two types of errors the model can make:

False Positive. Predict an event when there was no event (eg: the visitor will not make a purchase).
False Negative. Predict no event when in fact there was an event (eg: the visitor will make a purchase).

By calibrating a threshold based on accuracy and recall (aka: sensitivity), a balance of these two concerns can be chosen to help us limit risks of bad predictions with our model.
Choosing our decision threshold: For our case there is no catastrophic outcome if we wrongly predict a false positive (aka we predict a person will make a purchase but they don’t). As we use this data to trigger sales and marketing activities, in this case the better tradeoff to capture all potential customers would be to reduce the amount of False negatives, and maybe have a few more False positives. We would do this by selecting a threshold on the lower end say 0.4. Essentially a low precision and high recall or in other words we have an optimistic classifier/model. On the other hand if we have limited product and a very costly sales follow up process we would want the inverse and have high precision and low recall.

5. Put into Production

Now we have a custom trained model for our specific data and use case we can use our model with the selected parameters using knn_pipe.predict(X) to classify new site visits. This could be as simple as downloading analytics and loading the data locally into our jupyter notebook on a weekly or bi weekly basis top tag users with high intention. We can also create a simple streamlit application for anyone to upload our sites analytics data to get better data visualization. Or we can setup a more advanced integration for realtime prediction using fastapi integrated with your datasources. I am planning some posts to show how to do this. Please don’t hesitate to reach out with any questions or help with your own implementations.

How I built a simple AI model to convert more sales on Shopify

1. Collect data

2. Import libraries Pre process data

3. Setup a pipeline, our model then train & test

4. Tuning our model for desired results

5. Put into Production

Location

Contact

How I built a simple AI model to convert more sales on Shopify

1. Collect data

2. Import libraries Pre process data

3. Setup a pipeline, our model then train & test

4. Tuning our model for desired results

5. Put into Production

How to increase your focus, peak performance and happiness in just 10 minutes

How Tokenization & AI will change e-commerce

Location

Contact