netcal.metrics.QCE¶

class netcal.metrics.QCE(bins: int = 10, *, marginal: bool = True, sample_threshold: int = 1)¶

Marginal Quantile Calibration Error (M-QCE) and Conditional Quantile Calibration Error (C-QCE) which both measure the gap between predicted quantiles and observed quantile coverage also for multivariate distributions. The M-QCE and C-QCE have originally been proposed by [1]. The derivation of both metrics are based on the Normalized Estimation Error Squared (NEES) known from object tracking [2]. The derivation of both metrics is shown in the following.

Definition of standard NEES: Given mean prediction \(\hat{\boldsymbol{y}} \in \mathbb{R}^M\), ground-truth \(\boldsymbol{y} \in \mathbb{R}^M\), and estimated covariance matrix \(\hat{\boldsymbol{\Sigma}} \in \mathbb{R}^{M \times M}\) using \(M\) dimensions, the NEES is defined as

\[\epsilon = (\boldsymbol{y} - \hat{\boldsymbol{y}})^\top \hat{\boldsymbol{\Sigma}}^{-1} (\boldsymbol{y} - \hat{\boldsymbol{y}}) .\]

The average NEES is defined as the mean error over \(N\) trials in a Monte-Carlo simulation for Kalman-Filter testing, so that

\[\bar{\epsilon} = \frac{1}{N} \sum^N_{i=1} \epsilon_i .\]

Under the condition, that \(\mathbb{E}[\boldsymbol{y} - \hat{\boldsymbol{y}}] = \boldsymbol{0}\) (zero mean), a \(\chi^2\)-test is performed to evaluate the estimated uncertainty. This test is accepted, if

\[\bar{\epsilon} \leq \chi^2_M(\tau),\]

where \(\chi^2_M(\tau)\) is the PPF score obtained by a \(\chi^2\) distribution with \(M\) degrees of freedom, for a certain quantile level \(\tau \in [0, 1]\).

Marginal Quantile Calibration Error (M-QCE): In the case of regression calibration testing, we are interested in the gap between predicted quantile levels and observed quantile coverage probability for a certain set of quantile levels. We assert \(N\) observations of our test set that are used to estimate the NEES, so that we can compute the expected deviation between predicted quantile level and observed quantile coverage by

\[\text{M-QCE}(\tau) := \mathbb{E} \Big[ \big| \mathbb{P} \big( \epsilon \leq \chi^2_M(\tau) \big) - \tau \big| \Big] ,\]

which is the definition of the Marginal Quantile Calibration Error (M-QCE) [1]. The M-QCE is calculated by

\[\text{M-QCE}(\tau) = \Bigg| \frac{1}{N} \sum^N_{n=1} \mathbb{1} \big( \epsilon_n \leq \chi^2_M(\tau) \big) - \tau \Bigg|\]

Conditional Quantile Calibration Error (C-QCE): The M-QCE measures the marginal calibration error which is more suitable to test for quantile calibration. However, similar to netcal.metrics.regression.UCE and netcal.metrics.regression.ENCE, we want to induce a dependency on the estimated covariance, since we require that

\[ \begin{align}\begin{aligned}&\mathbb{E}[(\boldsymbol{y} - \hat{\boldsymbol{y}})(\boldsymbol{y} - \hat{\boldsymbol{y}})^\top | \hat{\boldsymbol{\Sigma}} = \boldsymbol{\Sigma}] = \boldsymbol{\Sigma},\\&\forall \boldsymbol{\Sigma} \in \mathbb{R}^{M \times M}, \boldsymbol{\Sigma} \succcurlyeq 0, \boldsymbol{\Sigma}^\top = \boldsymbol{\Sigma} .\end{aligned}\end{align} \]

To estimate the a covariance dependent QCE, we apply a binning scheme (similar to the netcal.metrics.confidence.ECE) over the square root of the standardized generalized variance (SGV) [3], that is defined as

\[\sigma_G = \sqrt{\text{det}(\hat{\boldsymbol{\Sigma}})^{\frac{1}{M}}} .\]

Using the generalized standard deviation, it is possible to get a summarized statistic across different combinations of correlations to denote the distribution’s dispersion. Thus, the Conditional Quantile Calibration Error (C-QCE) [1] is defined by

\[\text{C-QCE}(\tau) := \mathbb{E}_{\sigma_G, X}\Big[\Big|\mathbb{P}\big(\epsilon \leq \chi^2_M(\tau) | \sigma_G\big) - \tau \Big|\Big] ,\]

To approximate the expectation over the generalized standard deviation, we use a binning scheme with \(B\) bins (similar to the ECE) and \(N_b\) samples per bin to compute the weighted sum across all bins, so that

\[\text{C-QCE}(\tau) \approx \sum^B_{b=1} \frac{N_b}{N} | \text{freq}(b) - \tau |\]

where \(\text{freq}(b)\) is the coverage frequency within bin \(b\) and given by

\[\text{freq}(b) = \frac{1}{N_b} \sum_{n \in \mathcal{M}_b} \mathbb{1}\big(\epsilon_i \leq \chi^2_M(\tau)\big) ,\]

with \(\mathcal{M}_b\) as the set of indices within bin \(b\).

Parameters:

bins (int or iterable, default: 10) – Number of bins used for the internal binning. If iterable, use different amount of bins for each dimension (nx1, nx2, … = bins).
marginal (bool, optional, default: False) – If True, compute the M-QCE. This is the marginal probability over all samples falling into the desired quantiles. If False, use the C-QCE. The C-QCE uses a binning scheme by the gerneralized standard deviation to measure the conditional probability over all samples falling into the desired quantiles w.r.t. the generalized standard deviation.
sample_threshold (int, optional, default: 1) – Bins with an amount of samples below this threshold are not included into the miscalibration metrics.

References

Methods

`__init__`([bins, marginal, sample_threshold])	Constructor.
`binning`(bin_bounds, samples, *values[, nan])	Perform binning on value (and all additional values passed) based on samples.
`frequency`(X, y[, batched, uncertainty])	Measure the frequency of each point by binning.
`measure`(X, y, q, *[, kind, reduction, ...])	Measure quantile loss for given input data either as tuple consisting of mean and stddev estimates or as NumPy array consisting of a sample distribution.
`prepare`(X, y[, batched, uncertainty])	Check input data.
`process`(metric, acc_hist, conf_hist, ...)	Determine miscalibration based on passed histograms.
`reduce`(histogram, distribution, axis[, ...])	Calculate the weighted mean on a given histogram based on a dedicated data distribution.