1 Summary

This blog presents LeafSim, an example-based explainable AI (XAI) technique for decision tree based ensemble methods.
The process applies the Hamming distance on leaf indices to measure similarity between instances in the test and training set.
It therefore explains model predictions by identifying training data points that most influenced a given prediction.

The proposed technique is:

easy to interpret by non-technical people
complementing existing XAI techniques
straightforward to implement & maintain in production
computationally lightweight

2 Introduction

This blog post will explore example-based XAI techniques. As such, the article will introduce an approach that aims to complement well-established techniques, such as SHAP and Lime.
Specifically, the approach aims at explaining individual predictions of decision tree based ensemble methods, such as Catboost, regardless of whether they are regressors or classifiers.

Existing XAI approaches, such as SHAP, provide insights into the most relevant features, both from a global and local perspective. In addition, they can provide perspective to predictions of individual data points, such as measuring how model predictions differ with varying feature values.

However, for people with limited knowledge in Machine Learning, it can be less intuitive to parse such information and build an understanding of what underpins model predictions.

In such cases, providing example-based explanations can be very helpful, as we outline in this blog post.

3 Coding Environment

Code

%load_ext jupyter_black
%load_ext autoreload
%autoreload 2

Code

# Import required libraries
from leafsim import LeafSim
import matplotlib.pylab as plt
import matplotlib.ticker as ticker
import pandas as pd
import numpy as np
from IPython.display import Image
import seaborn as sns
from sklearn.model_selection import train_test_split
import shap
from catboost import CatBoostRegressor

# Helper functions
from utils import get_similarity_table, get_similarity_plots
from IPython.display import display

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Code

# https://github.com/catboost/catboost/issues/2179
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

Code

# Define settings
plt.rcParams["figure.dpi"] = 400
font = {"family": "serif", "weight": "normal"}
plt.rc("font", **font)
plt.rc("xtick", labelsize=16)
plt.rc("ytick", labelsize=16)
plt.rc("axes", labelsize=20)
plt.rc("figure", titlesize=22)
plt.rc("legend", fontsize=14)
np.random.seed(46)
shap.initjs()

4 Data

A dataset on used cars has been selected to analyze the proposed approach and provide examples. It contains information such as mileage and selling price on roughly 100,000 used cars from various popular brands and is available on Kaggle (as of July 2022).

Code

def load_df(filepath):
    df = pd.read_csv(filepath)
    # Clean column names
    df.columns = list(
        map(lambda x: x.strip().lower().replace(" ", "").replace("(£)", ""), df.columns)
    )
    # Add brand (taken from filename)
    df["brand"] = filepath.split("/")[-1].replace(".csv", "")
    return df

Code

# Manually specify files instead of using
# glob.glob("../data/[!unclean]*.csv")
# so that files are read in same order across systems.
# In the future, use sorted() but this would affect
# the progress made so far in the notebook
# as results would change
files = [
    "data/vauxhall.csv",
    "data/bmw.csv",
    "data/vw.csv",
    "data/hyundi.csv",
    "data/toyota.csv",
    "data/ford.csv",
    "data/focus.csv",
    "data/skoda.csv",
    "data/merc.csv",
]
df = pd.concat(
    [
        # Disregard files whose name starts with "unclean"
        load_df(i)
        for i in files
    ]
)
df.reset_index(drop=True, inplace=True)

Code

(df.dtypes.to_frame().reset_index().rename(columns={"index": "column", 0: "data_type"}))

	column	data_type
0	model	object
1	year	int64
2	price	int64
3	transmission	object
4	mileage	int64
5	fueltype	object
6	tax	float64
7	mpg	float64
8	enginesize	float64
9	brand	object

Predictors such as tax and miles per gallon were disregarded to simplify the analysis.

Code

# Define features
feature_cols = [
    "brand",
    "model",
    "year",
    "mileage",
    "transmission",
    "fueltype",
    "enginesize",
]
# of which the following ones are of categorical type:
categorical_feature_cols = ["brand", "model", "fueltype", "transmission"]

numeric_cols = [i for i in feature_cols if i not in categorical_feature_cols]

# Define the target of the model, i.e. what we want to predict
target_col = "price"

4.1 Clean data

The first and only cleaning step involves removing entries with unrealistic attributes (cars built in the future and those with an engine capacity of zero liters). Once completed, obtaining high-level summaries of the considered attributes is now possible.

Code

df = (
    df.loc[lambda x: (x.enginesize > 0) & (x.year <= 2022)]
    .copy()
    .reset_index(drop=True)
)

Code

# Create an ID to indexing precisely if random seed differs between systems
df["ID"] = df.index.tolist()

With this, let’s obtain high level summaries of the considered attributes.

Code

fig, axes = plt.subplots(4, (len(feature_cols) + 1) // 4, figsize=(20, 20))
plt.suptitle("Car attributes considered", fontsize=20)
axes = axes.flatten()
for i, c in enumerate(feature_cols + [target_col]):
    if c in categorical_feature_cols:
        if df[c].nunique() > 15:
            tmp = df[c].value_counts(normalize=True).reset_index()
            tmp.loc[15:, c] = "Other"
            tmp = tmp.groupby(by=c).sum().sort_values(by="proportion", ascending=False)
            tmp.plot(kind="bar", ax=axes[i], color="darkviolet", alpha=0.7)
            axes[i].set_xlabel("Top 15 + all remaining in 'Other'")
        else:
            df[c].value_counts(normalize=True).plot(
                kind="bar", ax=axes[i], color="darkviolet", alpha=0.7
            )
    else:
        sns.histplot(
            data=df, x=c, ax=axes[i], stat="probability", color="darkviolet", alpha=0.7
        )
        axes[i].axvline(
            df[c].mean(),
            label="Average",
            linewidth=1,
            linestyle="--",
            color="royalblue",
        )
        axes[i].set_xlabel("")
        axes[i].legend()
    axes[i].set_title(c.title(), fontsize=22)
    axes[i].set_ylabel("Share")
    axes[i].yaxis.set_major_formatter(
        ticker.FuncFormatter(lambda x, pos: f"{x*100:,.0f}%")
    )
    if c not in ["year", "model"]:
        axes[i].xaxis.set_major_formatter(
            ticker.FuncFormatter(lambda x, pos: f"{x:,.0f}")
        )
    if c == "price":
        axes[i].xaxis.set_major_formatter(
            ticker.FuncFormatter(lambda x, pos: f"£{x:,.0f}")
        )
        axes[i].tick_params(axis="x", labelrotation=45)
plt.tight_layout()
plt.savefig("images/eda_plot.png", dpi=600)

A few interesting insights stand out from these histograms:

Only a tiny fraction of cars are EVs or hybrids.
Most of the cars are sold at prices close to \(£16,000\).
Ford cars are the most popular ones, especially considering that the brand “focus” is most likely referring to a car model produced by Ford, namely Ford Focus.

While a more thorough analysis may reveal more, data exploration and preparation are outside the scope of this blog post and therefore have not been explored.

5 Task

With a grasp of the data, let us imagine ourselves in the shoes of a car dealer, who must decide, based on some relevant car properties, at what price to sell it.

To support car dealers, we will provide them with an automated tool that suggests an indicative, initial price point based on the prices of historically sold cars.

Such a tool is assumed to simulate how car dealers naturally set prices, by looking at prices that similar cars fetched.

6 Model

To solve the defined task, a Catboost Regressor is chosen. However, really any ensemble method of decision trees (regressors or classifiers) could be used.

For modeling, neardefault hyperparameters are used (only slightly modifying the number of estimators and size of leaves), and the data is randomly split in a 80-20 fashion.

Code

# Initiate model
model = CatBoostRegressor(random_seed=46, n_estimators=50, min_data_in_leaf=4)

# Define name of prediction column
predicted_col = "predicted_" + target_col

# Set categorical features to categorical dtype
df[categorical_feature_cols] = (
    df[categorical_feature_cols].fillna("None").astype("category")
)

# Define splits
train_idx, test_idx = train_test_split(
    np.arange(len(df)), test_size=0.2, random_state=46
)

# Manually modify the splits because there are 2 cars
# we would like to have in our test set
# As these will be used as examples throughout the notebook
# In case the random seed has a different
# effect across platforms and machines
# we can make sure that at least these 2 examples are the same
car_ids_in_test = [20844, 13284]
for car_id in car_ids_in_test:
    if car_id not in test_idx:
        print(f"Modifying train/test split to include car ID {car_id}")
        test_idx = np.append(test_idx, car_id)
        car_idx_train = np.where(train_idx == car_id)[0][0]
        train_idx = np.delete(train_idx, car_idx_train)
assert all([i in test_idx and i not in train_idx for i in car_ids_in_test])

# Create data splits
df_train = df.loc[train_idx].reset_index(drop=True).copy()
df_test = df.loc[test_idx].reset_index(drop=True).copy()

Code

# Train model
model.fit(
    df_train[feature_cols],
    df_train[target_col],
    cat_features=categorical_feature_cols,
    verbose=False,
);

Code

# Make predictions
for ds in [df_train, df_test]:
    # Make predictions
    ds[predicted_col] = model.predict(ds[feature_cols]).astype(int)
    # Evaluate model
    diff = ds[predicted_col] - ds[target_col]
    ds["absolute_percentage_error"] = diff.abs() / ds[target_col]
    ds["relative_percentage_error"] = diff / ds[target_col]

Code

(
    pd.concat((df_train.assign(split="train"), df_test.assign(split="test")))
    .groupby("split")
    .agg(
        mepe=("relative_percentage_error", "mean"),
        mape=("absolute_percentage_error", "mean"),
    )
    .rename(
        columns={
            "mepe": "Mean Relative Percentage Error",
            "mape": "Mean Absolute Percentage Error",
        }
    )
    .applymap(lambda x: "{:.1f}%".format(x * 100))
)

	Mean Relative Percentage Error	Mean Absolute Percentage Error
split
test	1.3%	9.4%
train	1.2%	9.4%

The results above suggest the model performs reasonably well as, on average, it is only off by about 9% and tends to slightly over-predict the actual price (ca. 1% higher on average).

Again, the goal here is not to develop the best performing model, but to ensure that it accomplishes the task with reasonable accuracy and that no overfitting occurs.

Now that the model is trained, we can turn to state-of-the-art XAI techniques to make this black box more transparent.

7 SHAP

Code

model_explainer = shap.TreeExplainer(model)
shap_values = model_explainer(df_train[feature_cols])
shap_values_test = model_explainer(df_test[feature_cols])

Lime and SHAP are popular tools and, in many aspects, very similar. However, as the objective is to describe their common shortcomings, we will only focus on SHAP.

SHAP can be used to obtain global insights into what features are most important and how they affect the predictions of a model. A concrete example would be obtaining the most critical features in terms of their average, absolute contributions to the model predictions.

These predictions are often visualized using the built-in bar plot provided by SHAP.

Code

shap.plots.bar(shap_values)

Here, we can see that the engine size is by far the most relevant attribute when it comes to predicting car sale prices (with an average absolute impact of \(\approx £ 3,100\)).

Similarly, the beeswarm plot is useful to gauge not only the impact but also the sign of the contribution.

Code

shap.plots.beeswarm(shap_values)

No data for colormapping provided via 'c'. Parameters 'vmin', 'vmax' will be ignored

We can see that larger engine sizes (higher feature values, coloured in red) are related to higher predictions of sales prices (larger SHAP value).

However, SHAP can also be used to obtain local explanations, i.e. explain the prediction of a particular instance, in our case a specific car we want to sell.

Consider the following car for which we would like to determine the sales price for:

Code

# Pick a specific car with nice properties
# for the sake of explaining how SHAP works
car_to_explain_id = 20844
example = df_test.query(f"ID == {car_to_explain_id}")[feature_cols]
car_to_explain_idx = example.index[0]
display(example)

	brand	model	year	mileage	transmission	fueltype	enginesize
872	bmw	M4	2016	32866	Semi-Auto	Petrol	3.0

To get some insight into what affected the decision of the model, we can use the popular waterfall plot:

Code

shap.plots.waterfall(shap_values_test[car_to_explain_idx])

The plot shows the influence of selected features on the model prediction \(f(x)\) for this specific car compared to the average prediction of all the cars in the training set \(E[f(X)]\). Concretely, the model expects the average car to sell for \(E[f(X)] \approx £16,000\), while it predicts the selected car to sell for \(f(x) \approx £31,000\).
A major reason that the model estimates the price of this car to be higher than that of the average is its \(3\) litre, engine size, which translates into an increase above the mean expected sales price of \(\approx £12,500\).

For car dealers however, the above explanations might not be very intuitive. They may not be interested in an explanation referencing the average car as it is not clear what the average car is without some insights into the dataset (see histograms in the Data section).
Instead, it might be more relevant to understand how this prediction compares to similar cars sold in the past.

To address this, we propose an approach to find similar training instances, i.e., those on which the model mainly bases its prediction. This way, car dealers can directly gauge the observed sales prices of similar cars, allowing them to put those into perspective and understand from where a predicted price stems.

To this end, we introduce our example-based XAI technique named LeafSim.

8 LeafSim

The proposed approach works only for decision trees and their ensemble derivatives, such as the popular Catboost algorithm.
It leverages how such models are built to easily extract similar training instances for a given prediction we wish to explain.

More specifically, each decision tree partitions the feature space through a series of binary questions.
Each question is a branch. If a branch is not followed by another question, it is colloquially known as a leaf.
Leaves are determined based on a chosen convergence criteria, such as RMSE for regression.
Further, leaves provide predictions based on the training observations ending up in them.
For regression, for example, these predictions are based on the average target value of those observations.

This entire process is succinctly summarised in the following image from the Encyclopedia of Machine Learning, which visualises the case of regression using two predictor variables.

Code

# Use Image to display so it can be rendered in PDF when running
# jupyter nbconvert --to webpdf --no-input notebook.ipynb
# As opposed to:
# <div>
# <img src="images/decision_tree_regression.jpg" width="600"/>
# </div>
Image(filename="images/decision_tree_regression.jpg", width=600)

Here, the squared boxes refer to the leaves; their associated prediction value is simply the average \(Y\) value of the observations falling into them.

Based on this understanding of how decision trees are built and how they operate, it becomes evident that two data points ending up in the same leaf have very similar features (by satisfying the same conditions as imposed by binary questions that create branches) and similar targets (by satisfying the convergence criteria of leaves).

Thus, we define similar instances as those that end up most often in the same leaf across all decision trees in an ensemble. Hence the name LeafSim. This approach is loosely related to [1], which introduced LeafRefit to rank training instances based on the change in loss for a given test instance.

Visually, LeafSim can be represented as follows:

Code

Image(filename="images/leaf_index.png", width=1200)

Here, \(x_{test}\) is the instance for which we want to explain the prediction and \(X_{train}\) are all the instances in the training data, with \(x_i\) being an individual observation (in our case a car).

Similarity can thus be measured using the Hamming distance.
Briefly, the Hamming distance captures, for two strings of equal length, the number of positions for which the symbols differ.
This allows us to measure the number of times leaf indices differ across trees.
By dividing this number by the number of trees and subtracting it from \(1\), we can transform it into a bounded similarity metric.

9 Implementation

In the case of Catboost, LeafSim can easily leverage the built-in function calc_leaf_indexes to obtain the leaf indices. For XGBoost the equivalent would be apply.

For those who wish to learn more about the implementation of LeafSim and access the code used in this blog post, please refer to the Renku repository.

Code

# Since this notebook is available on a hosting platform
# and can be executed there, let's reduce the computational complexity
# by only finding similar training observations for some cars in the test set
df_test_subset = pd.concat(
    (
        df_test.loc[lambda x: ~x.ID.isin(car_ids_in_test)].sample(frac=0.2),
        df_test.loc[lambda x: x.ID.isin(car_ids_in_test)],
    )
).reset_index(drop=True)

Code

leaf_sim = LeafSim(model)

Code

# Top 10
top_n_ids, top_n_similarities = leaf_sim.generate_explanations(
    X_train=df_train[feature_cols].values,
    X_to_explain=df_test_subset[feature_cols].values,
    top_n=10,
)

Code

# Top 50 (for more advanced analysis)
_, top_50_similarities, similarities = leaf_sim.generate_explanations(
    X_train=df_train[feature_cols].values,
    X_to_explain=df_test_subset[feature_cols].values,
    top_n=50,
    return_all_similarities=True,
)

10 Examples

To understand how these LeafSim scores can be utilized, let us start by analysing two concrete examples.

Code

# Get the most important features according to SHAP as a list
# These features will be used when showing similar instances
# Measure importance in absolute terms
abs_mean_shap_values = np.abs(shap_values.values).mean(0)
df_feature_importance = pd.DataFrame(
    list(zip(feature_cols, abs_mean_shap_values)),
    columns=["feature_name", "feature_importance"],
).sort_values(by=["feature_importance"], ascending=False)
# Restrict ourselves to the top 8 most influential features
top_n_features = list(df_feature_importance.feature_name.values[:8])

# Define the formatting of some columns
formatting = {
    target_col: lambda x: f"£ {x:,.0f}",
    predicted_col: lambda x: f"£ {x:,.0f}",
    "similarity": lambda x: f"{x*100:.0f}%",
    "enginesize": lambda x: f"{x:.1f}",
}

Code

avg_top_sim = top_50_similarities.mean(axis=1)
df_test_subset["avg_top_sim"] = avg_top_sim

Code

# Wrap the utils function into another function to
# reduce the number of parameters we have to pass
_get_similarity_table = lambda car_id: get_similarity_table(
    df_train,
    df_test_subset,
    top_n_ids,
    top_n_similarities,
    car_id,
    top_n_features,
    formatting,
)

_get_similarity_plots = lambda car_id: get_similarity_plots(
    df_train, df_test_subset, similarities, car_id, avg_top_sim
)

10.1 Example 1

As a first example, we will pick the BMW that we previously used to explain SHAP, and as this is a common brand and model, we expect the model to make a reasonably accurate prediction.

Code

# Pick the same car from the SHAP example
car_to_explain_id = 20844  # index in df_test
example = df_test_subset.query(f"ID == {car_to_explain_id}")[
    top_n_features + [predicted_col, target_col]
]
car_to_explain_idx = example.index[0]

Code

# Will not be nicely rendered in Quarto
# See https://github.com/quarto-dev/quarto-cli/discussions/1716
# display(example.style.format(formatting))
# Hence don't format prices:
(
    example.assign(
        predicted_price=lambda x: x.predicted_price.apply("£{:,}".format),
        price=lambda x: x.price.apply("£{:,}".format),
    )
)

	enginesize	year	model	mileage	brand	transmission	fueltype	predicted_price	price
3749	3.0	2016	M4	32866	bmw	Semi-Auto	Petrol	£30,755	£30,485

The table above describes the attributes of our selected car alongside the model prediction and the car’s actual sale price.

We can see the model provides an accurate, but we would like to understand which cars in the training set this prediction is mostly based on.

For simplicity, we restrict ourselves to the Top 10 cars with the highest LeafSim score, i.e., the ten cars that appear most often in the same leaf as the car we want to explain. This choice depends on how fast LeafSim scores drop in the ranking. If there is minimal change, then showing a large subset of the most similar observations will lead to redundant information.

The Top 10 cars are described in the table below.

Code

display(_get_similarity_table(car_to_explain_idx))

	enginesize	year	model	mileage	brand	transmission	fueltype	similarity	price
0	3.0	2016	M4	33113	bmw	Automatic	Petrol	8%	£ 32,983
1	3.0	2017	M4	32329	bmw	Semi-Auto	Petrol	8%	£ 34,890
2	3.0	2016	M4	23765	bmw	Semi-Auto	Petrol	6%	£ 34,932
3	3.0	2016	M4	38422	bmw	Semi-Auto	Petrol	6%	£ 30,490
4	3.0	2016	M4	23212	bmw	Semi-Auto	Petrol	6%	£ 32,998
5	3.0	2016	M4	21107	bmw	Semi-Auto	Petrol	6%	£ 31,498
6	3.0	2016	M4	21533	bmw	Semi-Auto	Petrol	6%	£ 31,870
7	3.0	2016	M4	27523	bmw	Semi-Auto	Petrol	2%	£ 30,990
8	3.0	2016	M4	30241	bmw	Semi-Auto	Petrol	2%	£ 29,990
9	3.0	2016	M4	34209	bmw	Semi-Auto	Petrol	0%	£ 29,998

The colours in the table represent how similar the cars are to the car we want to explain.
The lighter the shade of red of a cell, the more similar the two cars are in this respect,
while the darker the shade of blue in the similarity column is, the more similar the two cars.

We can see that for the most important features, the provided cars are indeed very similar.
Their sale prices are also very much in line with the selected car’s predicted sale price.

In the above table, we have restricted the data to the Top 10.
However, we could also investigate the relationship between price and the LeafSim score across the entire training set, as is shown in the following plot on the left.

Code

_get_similarity_plots(car_to_explain_idx)
plt.savefig("images/example_1.png", dpi=600)

The prices of cars with a higher LeafSim score vary less and are generally closer to the prediction of the car we want to explain.

To put the LeafSim score of this particular car in perspective, the plot on the right shows the distribution of the average LeafSim score for the Top 50 training instances.
We can see that this car has several similar training instances (the mean LeafSim score of the Top 50 is roughly 80%).
This indicates the model is basing its prediction on very relevant training samples and therefore the prediction can be assumed to be fairly accurate.

The choice of 50 here is somewhat arbitrary and data-dependent.
It should be a value that equals the number of observations we can typically expect to be highly similar.
Predictions with few similar training instances should stand out.

10.2 Example 2

As a second example, let us pick an electric car. As these are drastically under-represented in the dataset, the model will perform poorly on them.

Code

# Test set error rate by fueltype
(
    df_test.groupby(["fueltype"])
    .agg(mean_absolute_percentage_error=("absolute_percentage_error", "mean"))
    .sort_values(by="mean_absolute_percentage_error")
    .rename(
        columns={"mean_absolute_percentage_error": "Mean Absolute Percentage Error"}
    )
    .applymap(lambda x: "{:.1f}%".format(x * 100))
)

	Mean Absolute Percentage Error
fueltype
Hybrid	7.8%
Petrol	9.1%
Diesel	10.1%
Other	10.4%
Electric	31.7%

Indeed, the error rate of EVs is roughly 3x larger than for any other fuel type.

Let us pick a specific electric car:

Code

# Pick an electric car
car_to_explain_id = 13284  # index in df_test
example = df_test_subset.query(f"ID == {car_to_explain_id}")[
    top_n_features + [predicted_col, target_col]
]
car_to_explain_idx = example.index[0]

Code

# display(example.style.format(formatting))
(
    example.assign(
        predicted_price=lambda x: x.predicted_price.apply("£{:,}".format),
        price=lambda x: x.price.apply("£{:,}".format),
    )
)

	enginesize	year	model	mileage	brand	transmission	fueltype	predicted_price	price
3750	1.4	2015	Ampera	34461	vauxhall	Automatic	Electric	£17,123	£12,999

We already see that the predicted price is quite different from the actual price.

In a real world setting however, we do not know the actual price.
This is because the tool will be used by car dealers to set the selling price and thus the car has not been sold yet.

To better understand what this prediction is based on and whether one can fully trust it, we can again turn to the most similar training examples as identified by LeafSim.

Code

display(_get_similarity_table(car_to_explain_idx))

	enginesize	year	model	mileage	brand	transmission	fueltype	similarity	price
0	1.8	2015	Prius	19350	toyota	Automatic	Hybrid	62%	£ 17,495
1	1.8	2015	Prius	25360	toyota	Automatic	Hybrid	60%	£ 23,995
2	1.5	2015	i8	43323	bmw	Automatic	Hybrid	60%	£ 44,990
3	1.5	2015	i8	43102	bmw	Automatic	Hybrid	60%	£ 42,890
4	1.8	2015	Prius	48911	toyota	Automatic	Hybrid	60%	£ 18,795
5	1.8	2015	Prius	42929	toyota	Automatic	Other	60%	£ 12,000
6	1.8	2015	Prius	28001	toyota	Automatic	Other	58%	£ 21,795
7	1.8	2015	Prius	30862	toyota	Automatic	Other	58%	£ 21,990
8	1.8	2015	Prius	32314	toyota	Automatic	Hybrid	56%	£ 15,499
9	1.0	2015	i3	29465	bmw	Automatic	Electric	54%	£ 17,400

In this case, we can clearly see that the most similar training instances are not very close to the predicted car.

The LeafSim scores are low (below 50%) and crucial attributes vary considerably, such as brand, model, and fuel type.

Code

_get_similarity_plots(car_to_explain_idx)
plt.savefig("images/example_2.png", dpi=600)

We also see from the above plot on the left, that there is a lot of price variation, even for the most similar cars.
From the plot on the right we can clearly see that the average LeafSim scores of the most related training instances are very low.

Therefore, car dealers should lower their confidence in the model prediction for this particular car.

11 Conclusion

In this blog post, we addressed the interpretability gap of popular XAI techniques, such as SHAP and Lime, regarding their use by people with limited knowledge in Machine Learning.

Specifically, an example-based approach to explain decision-tree based ensemble methods was proposed, identifying the most relevant instances in the training set on which the model bases a particular prediction. This measure of relevance is expressed using the LeafSim score, which captures how often observations end up in the same leaf.

Mainly through a qualitative assessment of selected examples, we demonstrated that the LeafSim score indeed reflects similarity between observations.
Furthermore, the model tends to predict observations with higher LeafSim scores more accurately.
Therefore, end-users can use such scores to adjust their confidence in model predictions.

12 About the Author

Lucas Chizzali is a Data Scientist at the Swiss Data Science Center (SDSC).

Lucas joined the SDSC’s industry cell as a Data Scientist in November 2020, having previously worked in data related roles at the New York State Attorney and at Ericsson. He holds a BSc in Economics from Bocconi University, a MSc in Urban Science and Informatics from New York University as well as a MSc in Machine Learning from KTH Royal Institute of Technology. Over the course of his academic and professional career he has worked on a variety of topics, from computer vision tasks for automated driving to financial fraud detection to generating data driven insights to inform urban policy decisions.

LeafSim was developed while working at Richemont.

13 About the SDSC

The Swiss Data Science Center (SDSC) is a joint venture between EPFL and ETH Zurich. Its mission is to accelerate the adoption of data science and machine learning techniques within academic disciplines of the ETH domain, the Swiss academic community at large, and the industrial sector. In particular, it addresses the gap between those who create data, those who develop data analytics and systems, and those who could potentially extract value from it. The center comprises a multi-disciplinary team of data and computer scientists as well as experts in selected domains with offices in Zürich, Lausanne, and Villigen.
www.datascience.ch

14 Acknowledgements

The author would like to thank colleagues at Richemont and SDSC, namely Elvire Bouvier, Francesco Calabrese and Valerio Rossetti for their support in developing LeafSim and for their valuable feedback in writing this blog post.

15 References

[1] Sharchilev, Boris, Yury Ustinovskiy, Pavel Serdyukov, and Maarten Rijke. “Finding influential training samples for gradient boosted decision trees.” In International Conference on Machine Learning, pp. 4577-4585. PMLR, 2018.

16 Appendix

16.1 Technical Considerations

Each tree has the same weight in LeafSim, which may not be the most faithful approach depending on the Machine Learning method we would like to explain.
Empirical results for gradient boosting show that giving equal weights to all trees provides useful and insightful results.
One way to verify this is to analyse the correlation between error and the LeafSim score.