# the multivariate bias-variance decomposition.

The bias-variance decomposition allows one to decompose the contributions to the mean-squared error of an estimator into two parts, the variance stemming from the fluctuations of the prediction and the bias measuring the deviation of the expected value of the prediction from the true value. While being a very straightforward relation, it allows fundamental insight into necessary tradeoffs when constructing estimators. In this post I show how to derive the bias-variance decomposition in the multivariate setting motivated by a use case in quantum metrology and talk a bit about Stein’s paradox.

## The univariate bias-variance decomposition#

Consider an estimator $\hat{\varphi}$ of a scalar quantity $\phi$. The variance of $\hat{\varphi}$ is defined as the squared deviation from its expectation value $$\operatorname{Var}(\hat{\varphi}) := \mathbb{E}\{ (\hat{\varphi} - \mathbb{E}\{\hat{\varphi}\})^2 \}.$$ The bias is defined as the deviation of its expectation value from the true value that we want to estimate $$\operatorname{Bias}(\hat{\varphi}) := \mathbb{E}\{ \hat{\varphi} - \phi \}.$$ A resonable measure for estimator performance is the mean-squared error (MSE) $$\operatorname{MSE}(\hat{\varphi}) := \mathbb{E}\{ (\hat{\varphi} - \phi)^2 \}.$$ The bias-variance decomposition manifests itself in a simple equation relating the aforementioned quantities: $$\operatorname{MSE}(\hat{\varphi}) = \operatorname{Var}(\hat{\varphi}) + \operatorname{Bias}(\hat{\varphi})^2$$ The implications of this decomposition for learning scenarios (where it works analogously) are really well summarized in this infographic.

## The multivariate case#

The multivariate generalization of this decomposition is well known. You can read about it in your favourite textbook or in this blog post 1. Nevertheless I will re-derive it here in a slightly more general way which I am going to motivate shortly. But first we need to update our definitions to reflect the fact that both our estimator $\hat{\boldsymbol{\varphi}}$ and our ground truth $\boldsymbol{\phi}$ are now vectors. The bias is just the vector of deviations of the individual components $$\operatorname{Bias}(\hat{\boldsymbol{\varphi}}) := \mathbb{E}\{ \hat{\boldsymbol{\varphi}} - \boldsymbol{\phi} \}.$$ The generalisation of the variance is a bit more complicated, as we have to not only analyze the variances of the individual components of $\hat{\boldsymbol{\varphi}}$ but also the correlations between them. This is captured by the covariance matrix $$\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) := \mathbb{E} \{ (\hat{\boldsymbol{\varphi}} - \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\} \})(\hat{\boldsymbol{\varphi}} - \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\} \})^{T}\}.$$

#### Some metrological motivation

Before we come to the mutlivariate bias-variance decomposition we make a short detour to motivate it. In multi-parameter quantum metrology, one is interested in the precise measurement of physical quantities – which corresponds to the construction of an estimator of said quantities with minimal mean-squared error. Luckily for quantum metrology researchers, there exists a fundamental bound on the attainable precision for unbiased estimators we can exploit here. The Cramér-Rao bound links the minimal attainable covariance to the (classical) Fisher Information Matrix $I_{\boldsymbol{\phi}}$ that quantifies how much the measurement outcomes of a quantum state changes with the underlying physical quantities $\boldsymbol{\phi}$: $$\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) \geq I_{\boldsymbol{\phi}}^{-1}.$$ As a matrix inequality, $A \geq B$ means that $A - B$ is always a positive semi-definite matrix. The Cramér-Rao bound can be attained in the limit of infinite samples by maximum likelihood estimation and we therefore don’t need to worry about the estimator itself but only focus on the Fisher Information Matrix.

In our recent work 2, we used methods from quantum machine learning to optimize quantum sensing protocols by minimizing this bound. But to apply these methods the bound needed to be reduced to a scalar quantity. A straightforward way to do so is to trace over both sides of the Cramér-Rao bound to get the scalar inequality $$\operatorname{Tr}\{ \operatorname{Cov}(\hat{\boldsymbol{\varphi}})\} \geq \operatorname{Tr}\{ I_{\boldsymbol{\phi}}^{-1}\}.$$ While this works, it feels like we are throwing away information as the trace only depends on the diagonal elements of the matrices. Therefore, one usually considers a *weighted* trace. If we first multiply with a positive semi-definite weighting matrix $W$ and then perform the trace we get $$\operatorname{Tr}\{ W \operatorname{Cov}(\hat{\boldsymbol{\varphi}})\} \geq \operatorname{Tr}\{W I_{\boldsymbol{\phi}}^{-1}\}.$$ This allows us greater freedom to select which parameters we deem “important” and allows us to focus on selected linear combinations of the parameters. While this approach brings great flexibility, adding the weighting matrix always felt ad-hoc to me. But there is actually a neat way of justifying it – via a generalized multivariate bias-variance decomposition!

## A generalized multivariate bias-variance decomposition#

We consider the inner product generated by the positive semi-definite weighting matrix $W$ $$\langle \boldsymbol{x}, \boldsymbol{y} \rangle_W := \boldsymbol{x}^{T} W \boldsymbol{y}.$$ This inner product induces an associated seminorm $$\lVert \boldsymbol{x} \rVert_W^2 := \langle \boldsymbol{x}, \boldsymbol{x} \rangle_W = \boldsymbol{x}^{T}W\boldsymbol{x}.$$ We can now consider the mean-squared error in the seminorm induced by $W$ $$\operatorname{MSE}_W(\hat{\boldsymbol{\varphi}}) := \mathbb{E}\{ \lVert \hat{\boldsymbol{\varphi}} - \boldsymbol{\phi} \rVert_W^2 \}$$ which has the following bias-variance decomposition that I prove at the end of this post: $$\operatorname{MSE}_W(\hat{\boldsymbol{\varphi}}) = \operatorname{Tr}\{W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) \} + \lVert \operatorname{Bias}(\hat{\boldsymbol{\varphi}}) \rVert_W^2.$$ The bias-variance decomposition for the standard Euclidean norm is obtained by setting $W = \mathbb{I}$.

This means that the scalar obtained from the weighted trace of the covariance matrix of an unbiased estimator is actually equal to the mean-squared error in the seminorm induced by $W$! In this reading, the weighting matrix is not an ad-hoc construction but mirrors the fact that the mean-squared error is simply computed from a different seminorm.

We can prove the equality by some algebraic manipulations: \begin{align*} \operatorname{MSE}_W(\hat{\boldsymbol{\varphi}}) &= \mathbb{E}\{ \lVert \hat{\boldsymbol{\varphi}} - \boldsymbol{\phi} \rVert_W^2 \} \\ &= \mathbb{E}\{ (\hat{\boldsymbol{\varphi}} - \boldsymbol{\phi})^{T}W(\hat{\boldsymbol{\varphi}} - \boldsymbol{\phi}) \}\\ &= \mathbb{E}\{ \operatorname{Tr}\{ W(\hat{\boldsymbol{\varphi}} - \boldsymbol{\phi})(\hat{\boldsymbol{\varphi}} - \boldsymbol{\phi})^{T} \} \}\\ &= \operatorname{Tr}\{ W \mathbb{E}\{ (\hat{\boldsymbol{\varphi}} - \boldsymbol{\phi})(\hat{\boldsymbol{\varphi}} - \boldsymbol{\phi})^{T} \} \}\\ &= \operatorname{Tr}\{ W \mathbb{E}\{ (\hat{\boldsymbol{\varphi}} - \mathbb{E}\{\hat{\boldsymbol{\varphi}}\} + \operatorname{Bias}(\hat{\boldsymbol{\varphi}}))(\hat{\boldsymbol{\varphi}} - \mathbb{E}\{\hat{\boldsymbol{\varphi}}\} + \operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T} \} \}\\ &= \operatorname{Tr}\{ W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) + W(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T} \\ &\hphantom{\operatorname{Tr}\{ }+ W\mathbb{E}\{ (\hat{\boldsymbol{\varphi}} - \mathbb{E}\{\hat{\boldsymbol{\varphi}}\})\}(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T}\\ &\hphantom{\operatorname{Tr}\{ }+W(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))\mathbb{E}\{ (\hat{\boldsymbol{\varphi}} - \mathbb{E}\{\hat{\boldsymbol{\varphi}}\})^{T} \} \} \\ &=\operatorname{Tr}\{ W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) + W(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T} \} \\ &= \operatorname{Tr}\{ W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) \} + \lVert \operatorname{Bias}(\hat{\boldsymbol{\varphi}}) \rVert^2_W. \end{align*} In the fourth equality, we exploited the linearity of the trace to commute it with the expectation value. In the fifth equality, we used the identity $\boldsymbol{\phi} = \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\}-\operatorname{Bias}(\hat{\boldsymbol{\varphi}})$. In the seventh equality we used the fact that $\mathbb{E}\{ \hat{\boldsymbol{\varphi}} - \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\}\} = 0$.

One can not simply talk about the bias-variance decomposition and the related tradeoff without mentioning a particularly interesting example of the bias-variance tradeoff, namely Stein’s paradox. Assume that $\boldsymbol{\phi}$ is a vector with at least three components that are independent of each other and normally distributed. Let’s as an example consider the three components being the length of sea bass living in Lake Tahoe (are there sea bass at Lake Tahoe?), the height of buildings in Mexico City and the floor space of Berlin apartments. We would now like to estimate the mean vector of this distribution. In the usual formulation we are also requested to do so from seeing only one sample $\boldsymbol{x}$ of the distribution.
To this end, one would normally consider that estimating the mean as $\hat{\boldsymbol{\mu}}=\boldsymbol{x}$ would be the best approach, as this is also the maximum-likelihood estimate. But this is actually only true if the dimension of the data is less than $3$! “Shrinking” the estimator towards an arbitrary point will always result in an estimator that is both biased and correlated but still achieves lower mean-squared error. This effect is paradox because we assumed that the components of our distribution are independent, and yet there is a correlated estimator with superior performance.
A putative explanation of this paradox is as follows: We get a decreased mean-squared error when we shrink towards the true mean. But if we take the limit of the James-Stein estimator when shrinking towards infinity we actually recover the estimator $\hat{\boldsymbol{\mu}}=\boldsymbol{x}$. So shrinking towards any point will necessarily improve our estimate, as any point is better than infinity.