Clover plot

This page provides a quick overview of the clover plot, as proposed in the paper “Clover plot: Versatile visualisation in nonparametric classification”, by Dvořák, J., Hudecová, Š., and Nagy, S., submitted to the Statistical Analysis and Data Mining journal.

The current R implementation of the functions discussed below can be downloaded here. Some basic examples are also given here.

A more detailed discussion of the possible settings can be accessed using the appropriate links below.

Two-dimensional dataset

First, the clover plot is constructed for training data generated from two bivariate normal distributions with different means and the same variance matrices:

library(mvtnorm)

n1 <- 200
n2 <- 150

set.seed(2020)

X <- rmvnorm(n1, mean=c(0,0), sigma=diag(2)) # Observations of class 1
Y <- rmvnorm(n2, mean=c(2,2), sigma=diag(2)) # Observations of class 2

All the necessary quantities required later for visualisation and classification are computed (with default settings):

res <- clover_calc(X, Y)

Visualisation

For the two-dimensional dataset visualisation is simple:

clover_plot_data(res)

Black points: observations of class 1 (sample X);
blue points: observations of class 2 (sample Y).

Triangles: observations with zero halfspace depth w.r.t. both samples.

Squares: observations marked as outliers by the bagplot procedure.

Grey/blue polygons: sets containing 50 % of observed points in each sample, used for the bagplot and illumination procedures.

Hoovering tooltip: information about the given observation: its class label (1 for the X sample and 2 for the Y sample), its index within the corresponding sample and the list of the coordinates of the observation in the four quadrants of the clover plot.

Clover plot for the training data

clover_plot(res)

## Misclassification rate of the QDA classifier:   0.0829
## Non-classification rate of the QDA classifier:  0
## 
## Misclassification rate of the DD1 classifier:   0.0686
## Non-classification rate of the DD1 classifier:  0.0486

Black points: observations of class 1 (sample X); blue points: observations of class 2 (sample Y).

Triangles: observations with zero halfspace depth w.r.t. both samples.

Squares: observations marked as outliers by the bagplot procedure.

Purple curve in quadrant IV: boundary of the region where observed points can occur in the plot of Mahalanobis distances.

Purple curve in quadrant II: the same, plotted over the plot of bagdistances as an approximation.

Red dashed curve in quadrant I: separating curve of the linear DD classification rule (DD1).

Red dashed curve in quadrant IV: separating curve of the QDA classification rule.

Hoovering tooltip: information about the given observation: its class label (1 for the X sample and 2 for the Y sample), the number of ties (see an explanation here), its index within the corresponding sample and the list of the coordinates of the observation in the four quadrants of the clover plot.

Text output: for given classifiers (QDA and DD1 by default), the misclassification rates in the training dataset is reported (a fraction of observations with incorrectly assigned class labels) as well as the non-classification rate in the training dataset (a fraction of observations to which a class label cannot be assigned by the given classifier; applies almost exclusively to the DD type classifiers).

Real four-dimensional dataset

Next, the four-dimensional `biomed’ dataset, see details, is inspected. A direct visualisation is not simple in this case while the clover plot can still provide useful information.

library(ddalpha)

data(biomed)

class <- biomed$C
X.b <- as.matrix(biomed[class==1,1:4])
Y.b <- as.matrix(biomed[class==2,1:4])

res.b <- clover_calc(X.b,Y.b,alphaX=0.08,alphaY=0.08)

## Computing values to be plotted.
## Status:0
## Status:0
## Computing boundary curves.
## Computing classifications.
## Computing number of ties.
## Done.
## 
## Central regions used for computing the illuminations are based on the following quantities:
## alphaX =  0.08
## alphaY =  0.08
## Fraction of points in sample X with halfspace depth >= alphaX:  0.239
## Fraction of points in sample Y with halfspace depth >= alphaY:  0.236

clover_plot(res.b)

## Misclassification rate of the QDA classifier:   0.1907
## Non-classification rate of the QDA classifier:  0
## 
## Misclassification rate of the DD1 classifier:   0.0722
## Non-classification rate of the DD1 classifier:  0.1856

Arguments

Function “clover_calc”

This function computes the relevant quantities necessary for producing the clover plot. It returns a list of auxiliary entries to be used by the “clover_plot” function. The “clover_calc” function is expected to be run only once, performing all the possibly lengthy computations, allowing the “clover_plot” function to be called repeatedly with different options.

X: matrix, a n1 x d matrix of d-dimensional observations from the 1st class.

Y: matrix, a n2 x d matrix of d-dimensional observations from the 2nd class.

Z: optional, a nZ x d matrix of d-dimensional observations with unknown class labels, or a numeric vector of length d, to be also plotted in the clover plot. Examples are provided here.

alphaX: numeric, which upper level set of the halfspace depth should be used for the X sample for computing the illuminations? Examples are provided here.

alphaY: numeric, which upper level set of the halfspace depth should be used for the Y sample for computing the illuminations? Examples are provided here.

boundaryI: logical, should the boundary curve for the illuminations be computed? Examples are provided here.

Function “clover_plot”

This function produces the clover plot itself, based on the quantities computed by the “clover_calc” function. The “clover_plot” function is expected to be fast, enabling the user to produce repeatedly clover plots with different options.

clover: list, an output of the function “clover_calc”.

drawZ: logical, should new observations Z be plotted, if available? Examples are provided here.

interactive: logical, should an interactive plot be produced instead of a static one? Examples are provided here.

classifiers: character vector, which classifying boundaries should be plotted? Examples are provided here.

boundaries: logical, should boundary curves in quadrants II to IV be plotted? Examples are provided here.

jitter: logical, should a jitter be applied to quadrant I and to the point [0,0] in quadrant III? Examples are provided here.

uncertain: character, should points with uncertain classification w.r.t. the given classifier be highlighted? Examples are provided here.

uncertain.amount: numeric or NULL (default), how far away from the given classification boundary should the points be highlighted as uncertain? Examples are provided here.

edf.trans: logical, should the transformation based on empirical cumulative distribution function be applied in each marginal, in each quadrant separately? Examples are provided here.

samescale: logical, should the values plotted in quadrants II to IV be scaled using the same factor on both axes or different factors on both axes, to use up the whole quadrant? Examples are provided here.

offset: non-negative value, an offset of the plotted points from the coordinate axes. Examples are provided here.

Function “clover_plot_data”

This function can be used for interactive visualisation of two-dimensional datasets, allowing the same hovering tooltip information as the clover plot itself (apart from the number of ties which are relevant only in the clover plot itself). New observations with unknown class labels can be plotted, too.

X: matrix, a n1 x d matrix of d-dimensional observations from the 1st class.

Y: matrix, a n2 x d matrix of d-dimensional observations from the 2nd class.

clover: list, an output of the function “clover_calc”

Z: optional, a nZ x d matrix of d-dimensional observations with unknown class labels, or a numeric vector of length d. Examples are provided here.

interactive: logical, should an interactive plot be produced instead of a static one? Examples are provided here.

central.regions: logical, should the central regions used for computing the illuminations be plotted? Examples are provided here.

Function “clover_classify”

This function performs classification of new observations, based on classifying rules considered in the “clover_calc” function. For selected types of classifiers, vectors of class labels are produced and returned in a list. This enables the user to use the classifiers directly or combine them in a particular manner (using e.g. the majority vote approach).

Z: matrix or numeric vector, a nZ x d matrix of d-dimensional observations with unknown class labels, or a numeric vector of length d. Examples are provided here.

clover: list, an output of the function “clover_calc”.

classifiers: character vector, which types of classifiers should be reported? Examples are provided here.

Further examples

Examples with distributions having heavy tails are given here.

Examples with unbalanced data or unequal variances are given here.