| Title: | Multivariate Outlier Explanations using Shapley Values and Mahalanobis Distances |
|---|---|
| Description: | Based on Shapley values to explain multivariate outlyingness and to detect and impute cellwise outliers. Includes implementations of methods described in Mayrhofer and Filzmoser (2023) <doi:10.1016/j.ecosta.2023.04.003>. |
| Authors: | Marcus Mayrhofer [aut, cre], Peter Filzmoser [aut] |
| Maintainer: | Marcus Mayrhofer <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.2 |
| Built: | 2026-05-23 07:21:18 UTC |
| Source: | https://github.com/cran/ShapleyOutlier |
The MOE function indicates outlying cells for
a data vector with entries or data matrix with entries containing only numeric entries x
for a given center mu and covariance matrix Sigma using the Shapley value.
It is a more sophisticated alternative to the SCD algorithm,
which uses the information of the regular cells to derive an alternative reference point (Mayrhofer and Filzmoser 2023).
MOE( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, local = TRUE, max_iter = 1000, q = 0.99, check_outlyingness = FALSE, check = TRUE, cells = NULL, method = "cellMCD" )MOE( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, local = TRUE, max_iter = 1000, q = 0.99, check_outlyingness = FALSE, check = TRUE, cells = NULL, method = "cellMCD" )
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
local |
Logical. If TRUE (default), the non-central Chi-Squared distribution is used to determine the cutoff value based on |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
check_outlyingness |
Logical. If TRUE (default), the outlyingness is rechecked after applying |
check |
Logical. If |
cells |
Either |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
A list of class shapley_algorithm (new_shapley_algorithm) containing the following:
x |
A |
phi |
A |
mu_tilde |
A |
x_original |
A |
x_original |
The non-centrality parameters for the Chi-Squared distribution |
x_history |
A list with |
phi_history |
A list with |
mu_tilde_history |
A list with |
S_history |
A list with |
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) MOE_x <- MOE(x = x, mu = mu, Sigma = Sigma) plot(MOE_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20)p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) MOE_x <- MOE(x = x, mu = mu, Sigma = Sigma) plot(MOE_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20)
shapley.This function creates an object of class shapley that is returned by the shapley function.
new_shapley(phi = numeric(), mu_tilde = NULL, non_centrality = NULL)new_shapley(phi = numeric(), mu_tilde = NULL, non_centrality = NULL)
phi |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
Named list of class shapley, containing the input parameters.
shapley_algorithm.This function creates an object of class shapley_algorithm that is returned
by the SCD and MOE functions.
new_shapley_algorithm( x = numeric(), phi = numeric(), x_original = numeric(), mu_tilde = NULL, non_centrality = NULL, x_history = NULL, phi_history = NULL, mu_tilde_history = NULL, S_history = NULL )new_shapley_algorithm( x = numeric(), phi = numeric(), x_original = numeric(), mu_tilde = NULL, non_centrality = NULL, x_history = NULL, phi_history = NULL, mu_tilde_history = NULL, S_history = NULL )
x |
A |
phi |
A |
x_original |
A |
mu_tilde |
Optional. A |
non_centrality |
Optional. The non-centrality parameters for the Chi-Squared distribution,
which are given by |
x_history |
Optional. A list with |
phi_history |
Optional. A list with |
mu_tilde_history |
Optional. A list with |
S_history |
Optional. A list with |
Named list of class shapley_algorithm, containing the input parameters.
shapley_interaction.This function creates an object of class shapley_interaction that is returned
by the shapley_interaction function.
new_shapley_interaction(PHI = numeric())new_shapley_interaction(PHI = numeric())
PHI |
A |
Matrix of class shapley_interaction, containing input matrix PHI.
Barplot of Shapley values
## S3 method for class 'shapley' plot( x, subset = NULL, chi2.q = 0.99, abbrev.var = 3, abbrev.obs = 10, sort.var = FALSE, sort.obs = FALSE, plot_md = TRUE, md_squared = TRUE, rotate_x = TRUE, ... )## S3 method for class 'shapley' plot( x, subset = NULL, chi2.q = 0.99, abbrev.var = 3, abbrev.obs = 10, sort.var = FALSE, sort.obs = FALSE, plot_md = TRUE, md_squared = TRUE, rotate_x = TRUE, ... )
x |
A list of class |
subset |
Either an integer, |
chi2.q |
Quantile, only used if |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
plot_md |
Logical. If |
md_squared |
Logical. If |
rotate_x |
Logical. If |
... |
Optional arguments passed to methods. |
Returns a barplot that displays the Shapley values (shapley)for each observation and optionally (plot_md = TRUE)
includes the squared Mahalanobis distance (black bar) and the corresponding (non-)central chi-square quantile (dotted line).
library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 1:20) plot(call_shapley, subset = 5, rotate_x = FALSE) plot(call_shapley, subset = 5, md_squared = FALSE, rotate_x = FALSE)library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 1:20) plot(call_shapley, subset = 5, rotate_x = FALSE) plot(call_shapley, subset = 5, md_squared = FALSE, rotate_x = FALSE)
Barplot and tileplot of Shapley values.
## S3 method for class 'shapley_algorithm' plot( x, type = "both", subset = NULL, abbrev.var = FALSE, abbrev.obs = FALSE, sort.var = FALSE, sort.obs = FALSE, n_digits = 2, rotate_x = TRUE, continuous_rowname = FALSE, ... )## S3 method for class 'shapley_algorithm' plot( x, type = "both", subset = NULL, abbrev.var = FALSE, abbrev.obs = FALSE, sort.var = FALSE, sort.obs = FALSE, n_digits = 2, rotate_x = TRUE, continuous_rowname = FALSE, ... )
x |
A list of class |
type |
Either |
subset |
Either an integer, |
abbrev.var |
Integer. If |
abbrev.obs |
Integer. If |
sort.var |
Logical. If |
sort.obs |
Logical. If |
n_digits |
Integer. If |
rotate_x |
Logical. If |
continuous_rowname |
Logical. If |
... |
Arguments passed on to |
Returns plots for a list of class shapley_algorithm.
If type is "bar", a barplot is generated. It displays the Shapley values (shapley)
for each observation and optionally (plot_md = TRUE) includes the squared Mahalanobis distance (black bar)
and the corresponding (non-)central chi-square quantile (dotted line).
If type is "cell" a tileplot is generated. It displays each cells of the dataset and shows the original value from the observations,
color coding indicates whether those values were higher (red) or lower (blue) than the imputed values,
and the color intensity is based on the magnitude of the Shapley value.
If type is "both", the barplot and the tileplot are generated.
library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20, n_digits = 0)library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) MOE_X <- MOE(X, mu, Sigma) plot(MOE_X, subset = 20, n_digits = 0)
Plot of Shapley interaction indices
## S3 method for class 'shapley_interaction' plot( x, abbrev = 4, title = "Shapley Interaction", legend = TRUE, text_size = 22, ... )## S3 method for class 'shapley_interaction' plot( x, abbrev = 4, title = "Shapley Interaction", legend = TRUE, text_size = 22, ... )
x |
A |
abbrev |
Integer. If |
title |
Character. Title of the plot. |
legend |
Logical. If TRUE (default), a legend is plotted. |
text_size |
Integer. Size of the text in the plot |
... |
Optional arguments passed to methods. |
Returns a figure consisting of two panels. The upper panel shows the Shapley values, and the lower panel the Shapley interaction indices.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) PHI <- shapley_interaction(x, mu, Sigma) plot(PHI)p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) PHI <- shapley_interaction(x, mu, Sigma) plot(PHI)
shapley.Print function for class shapley.
## S3 method for class 'shapley' print(x, ...)## S3 method for class 'shapley' print(x, ...)
x |
List of class |
... |
Optional arguments passed to methods. |
Prints the list entries of x that are not NULL.
shapley_algorithm.Print function for class shapley_algorithm.
## S3 method for class 'shapley_algorithm' print(x, ...)## S3 method for class 'shapley_algorithm' print(x, ...)
x |
List of class |
... |
Optional arguments passed to methods. |
Prints the imputed data and the Shapley values.
shapley_interaction.Print function for class shapley_interaction.
## S3 method for class 'shapley_interaction' print(x, ...)## S3 method for class 'shapley_interaction' print(x, ...)
x |
Matrix of class |
... |
Optional arguments passed to methods. |
Prints the Shapley interaction indices.
The SCD function indicates outlying cells for
a data vector with entries or data matrix with entries containing only numeric entries x
for a given center mu and covariance matrix Sigma using the Shapley value (Mayrhofer and Filzmoser 2023).
SCD( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, max_iter = 1000, q = 0.99, method = "cellMCD", check = TRUE, cells = NULL )SCD( x, mu, Sigma, Sigma_inv = NULL, step_size = 0.1, min_deviation = 0, max_step = NULL, max_iter = 1000, q = 0.99, method = "cellMCD", check = TRUE, cells = NULL )
x |
Data vector with |
mu |
Either |
Sigma |
Either |
Sigma_inv |
Either |
step_size |
Numeric. Step size for the imputation of outlying cells, with |
min_deviation |
Numeric. Detection threshold, with |
max_step |
Either |
max_iter |
Integer. The maximum number of iterations. |
q |
Numeric. The quantile of the Chi-squared distribution for detection and imputation of outliers. Defaults to |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
A list of class shapley_algorithm (new_shapley_algorithm) containing the following:
x |
A |
phi |
A |
x_original |
A |
x_history |
The path of how the original data vector was modified. |
phi_history |
The Shapley values corresponding to |
S_history |
The indices of the outlying cells in each iteration. |
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) SCD_x <- SCD(x = x, mu = mu, Sigma = Sigma) plot(SCD_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) SCD_X <- SCD(X, mu, Sigma) plot(SCD_X, subset = 20)p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) SCD_x <- SCD(x = x, mu = mu, Sigma = Sigma) plot(SCD_x) library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) SCD_X <- SCD(X, mu, Sigma) plot(SCD_X, subset = 20)
The shapley function computes a -dimensional vector containing the decomposition of the
squared Mahalanobis distance of x (with respect to mu and Sigma)
into outlyingness contributions of the individual variables (Mayrhofer and Filzmoser 2023).
The value of the -th coordinate of this vector represents the
average marginal contribution of the -th variable to the squared Mahalanobis distance of
the individual observation x.
If cells is provided, Shapley values of x are computed with respect to a local reference point,
that is based on a cellwise prediction of each coordinate, using the information of the regular cells of x, see (Mayrhofer and Filzmoser 2023).
If x is a matrix, a matrix is returned, containing the decomposition for each row.
shapley( x, mu = NULL, Sigma = NULL, inverted = FALSE, method = "cellMCD", check = TRUE, cells = NULL )shapley( x, mu = NULL, Sigma = NULL, inverted = FALSE, method = "cellMCD", check = TRUE, cells = NULL )
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
method |
Either "cellMCD" (default) or "MCD". Specifies the method used for parameter estimation if |
check |
Logical. If |
cells |
Either |
phi |
A |
mu_tilde |
A |
non_centrality |
The non-centrality parameters for the Chi-Squared distribution, given by |
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
## Without outlying cells as input in the 'cells' argument#' # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley(x, mu, Sigma) phi <- shapley(x, mu, Sigma_inv, inverted = TRUE) plot(phi) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 20) ## Giving outlying cells as input in the 'cells' argument # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) call_shapley <- shapley(x, mu, Sigma_inv, inverted = TRUE, method = "cellMCD", check = TRUE, cells = c(1,1,0,0,0)) plot(call_shapley) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma, cells = (X_clean - X)!=0) plot(call_shapley, subset = 20)## Without outlying cells as input in the 'cells' argument#' # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley(x, mu, Sigma) phi <- shapley(x, mu, Sigma_inv, inverted = TRUE) plot(phi) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma) plot(call_shapley, subset = 20) ## Giving outlying cells as input in the 'cells' argument # Single observation p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) call_shapley <- shapley(x, mu, Sigma_inv, inverted = TRUE, method = "cellMCD", check = TRUE, cells = c(1,1,0,0,0)) plot(call_shapley) # Multiple observations library(MASS) set.seed(1) n <- 100; p <- 10 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 X <- mvrnorm(n, mu, Sigma) X_clean <- X X[sample(1:(n*p), 100, FALSE)] <- rep(c(-5,5),50) call_shapley <- shapley(X, mu, Sigma, cells = (X_clean - X)!=0) plot(call_shapley, subset = 20)
The shapley_interaction function computes a matrix
containing pairwise outlyingness scores based on Shapley interaction indices.
It decomposes the squared Mahalanobis distance of x (with respect to mu and Sigma)
into outlyingness contributions of pairs of variables (Mayrhofer and Filzmoser 2023).
shapley_interaction(x, mu, Sigma, inverted = FALSE)shapley_interaction(x, mu, Sigma, inverted = FALSE)
x |
Data vector with |
mu |
Either |
Sigma |
Either |
inverted |
Logical. If |
A matrix containing the decomposition of the squared Mahalanobis distance of x
into outlyingness scores for pairs of variables with respect to mu and Sigma.
Mayrhofer M, Filzmoser P (2023). “Multivariate outlier explanations using Shapley values and Mahalanobis distances.” Econometrics and Statistics.
p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley_interaction(x, mu, Sigma) PHI <- shapley_interaction(x, mu, Sigma_inv, inverted = TRUE) plot(PHI)p <- 5 mu <- rep(0,p) Sigma <- matrix(0.9, p, p); diag(Sigma) = 1 Sigma_inv <- solve(Sigma) x <- c(0,1,2,2.3,2.5) shapley_interaction(x, mu, Sigma) PHI <- shapley_interaction(x, mu, Sigma_inv, inverted = TRUE) plot(PHI)
Monthly data from the weather station Hohe Warte since April 1872 - Vienna (Stadt Wien 2022).
WeatherViennaWeatherVienna
A data frame with 1,804 rows and 25 columns:
yearYear
monthMonth
tDaily mean air temperature in °C, (t7 mean + t19 mean + tmax mean + tmin mean)/4; before 1971: t7 mean + t14 mean + 2 x t21 mean)
t_maxAbsolute maximum air temperature in °C
t_minAbsolute air temperature minimum in °C
avg_t_maxMean daily maximum air temperature in °C
avg_t_minMean daily minimum air temperature in °C
num_frostNumber of frost days (days with a temperature maximum tmin < 0.0 °C)
num_iceNumber of ice days (days with a temperature maximum tmax < 0.0 °C)
num_summerNumber of summer days (days with a temperature maximum tmax >= 25.0 °C)
num_heatNumber of hot days (days with a temperature maximum tmax >= 30.0 °C)
pDaily mean air pressure in hPa (mean of all measurements at 7 a.m., 2 p.m., 7 p.m. CET; before 1971 9 p.m. instead of 7 p.m.)
p_maxMaximum air pressure in hPa (maximum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
p_minMinimum air pressure in hPa (minimum of all measurements7 am, 2 pm, 7 pm CET; before 1971 9 pm instead of 7 pm)
sun_hMonthly total sunshine duration in hours
num_clearNumber of clear days (daily mean cloudiness < 20/100)
num_cloudNumber of cloudy days (daily mean cloudiness > 80/100)
rel_humDaily mean relative humidity in percent (2 x RH7 mean + RH14 mean + RH19 mean)/4; before 1971 9 p.m. instead of 7 p.m.)
rel_hum_maxRelative humidity maximum in percent
rel_hum_minRelative humidity minimum in percent
wind_vMonthly average wind speed in km/h
num_wind_v60Number of days with wind peaks >= 60 km/h
wind_v_maxMaximum wind speed in km/h
precp_sumMonthly total precipitation in mm
num_precp_01Number of days with precipitation >= 0.1 mm
The data were downloaded from https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien in September 2022.
Stadt Wien (2022). “Monthly data from the weather station Hohe Warte since April 1872 - Vienna.” https://www.data.gv.at/katalog/dataset/wetter-seit-1872-hohe-warte-wien.
data("WeatherVienna") summary(WeatherVienna)data("WeatherVienna") summary(WeatherVienna)