the multivariate biasvariance decomposition.
The biasvariance decomposition allows one to decompose the contributions to the meansquared error of an estimator into two parts, the variance stemming from the fluctuations of the prediction and the bias measuring the deviation of the expected value of the prediction from the true value. While being a very straightforward relation, it allows fundamental insight into necessary tradeoffs when constructing estimators. In this post I show how to derive the biasvariance decomposition in the multivariate setting motivated by a use case in quantum metrology and talk a bit about Stein’s paradox.
The univariate biasvariance decomposition#
Consider an estimator $\hat{\varphi}$ of a scalar quantity $\phi$. The variance of $\hat{\varphi}$ is defined as the squared deviation from its expectation value $$ \operatorname{Var}(\hat{\varphi}) := \mathbb{E}\{ (\hat{\varphi}  \mathbb{E}\{\hat{\varphi}\})^2 \}. $$ The bias is defined as the deviation of its expectation value from the true value that we want to estimate $$ \operatorname{Bias}(\hat{\varphi}) := \mathbb{E}\{ \hat{\varphi}  \phi \}. $$ A resonable measure for estimator performance is the meansquared error (MSE) $$ \operatorname{MSE}(\hat{\varphi}) := \mathbb{E}\{ (\hat{\varphi}  \phi)^2 \}. $$ The biasvariance decomposition manifests itself in a simple equation relating the aforementioned quantities: $$ \operatorname{MSE}(\hat{\varphi}) = \operatorname{Var}(\hat{\varphi}) + \operatorname{Bias}(\hat{\varphi})^2 $$ The implications of this decomposition for learning scenarios (where it works analogously) are really well summarized in this infographic.
The multivariate case#
The multivariate generalization of this decomposition is well known. You can read about it in your favourite textbook or in this blog post ^{1}. Nevertheless I will rederive it here in a slightly more general way which I am going to motivate shortly. But first we need to update our definitions to reflect the fact that both our estimator $\hat{\boldsymbol{\varphi}}$ and our ground truth $\boldsymbol{\phi}$ are now vectors. The bias is just the vector of deviations of the individual components $$ \operatorname{Bias}(\hat{\boldsymbol{\varphi}}) := \mathbb{E}\{ \hat{\boldsymbol{\varphi}}  \boldsymbol{\phi} \}. $$ The generalisation of the variance is a bit more complicated, as we have to not only analyze the variances of the individual components of $\hat{\boldsymbol{\varphi}}$ but also the correlations between them. This is captured by the covariance matrix $$ \operatorname{Cov}(\hat{\boldsymbol{\varphi}}) := \mathbb{E} \{ (\hat{\boldsymbol{\varphi}}  \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\} \})(\hat{\boldsymbol{\varphi}}  \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\} \})^{T}\}. $$
Some metrological motivation
Before we come to the mutlivariate biasvariance decomposition we make a short detour to motivate it. In multiparameter quantum metrology, one is interested in the precise measurement of physical quantities – which corresponds to the construction of an estimator of said quantities with minimal meansquared error. Luckily for quantum metrology researchers, there exists a fundamental bound on the attainable precision for unbiased estimators we can exploit here. The CramérRao bound links the minimal attainable covariance to the (classical) Fisher Information Matrix $I_{\boldsymbol{\phi}}$ that quantifies how much the measurement outcomes of a quantum state changes with the underlying physical quantities $\boldsymbol{\phi}$: $$ \operatorname{Cov}(\hat{\boldsymbol{\varphi}}) \geq I_{\boldsymbol{\phi}}^{1}. $$ As a matrix inequality, $A \geq B$ means that $A  B$ is always a positive semidefinite matrix. The CramérRao bound can be attained in the limit of infinite samples by maximum likelihood estimation and we therefore don’t need to worry about the estimator itself but only focus on the Fisher Information Matrix.
In our recent work ^{2}, we used methods from quantum machine learning to optimize quantum sensing protocols by minimizing this bound. But to apply these methods the bound needed to be reduced to a scalar quantity. A straightforward way to do so is to trace over both sides of the CramérRao bound to get the scalar inequality $$ \operatorname{Tr}\{ \operatorname{Cov}(\hat{\boldsymbol{\varphi}})\} \geq \operatorname{Tr}\{ I_{\boldsymbol{\phi}}^{1}\}. $$ While this works, it feels like we are throwing away information as the trace only depends on the diagonal elements of the matrices. Therefore, one usually considers a *weighted* trace. If we first multiply with a positive semidefinite weighting matrix $W$ and then perform the trace we get $$ \operatorname{Tr}\{ W \operatorname{Cov}(\hat{\boldsymbol{\varphi}})\} \geq \operatorname{Tr}\{W I_{\boldsymbol{\phi}}^{1}\}. $$ This allows us greater freedom to select which parameters we deem “important” and allows us to focus on selected linear combinations of the parameters. While this approach brings great flexibility, adding the weighting matrix always felt adhoc to me. But there is actually a neat way of justifying it – via a generalized multivariate biasvariance decomposition!
A generalized multivariate biasvariance decomposition#
We consider the inner product generated by the positive semidefinite weighting matrix $W$ $$ \langle \boldsymbol{x}, \boldsymbol{y} \rangle_W := \boldsymbol{x}^{T} W \boldsymbol{y}. $$ This inner product induces an associated seminorm $$ \lVert \boldsymbol{x} \rVert_W^2 := \langle \boldsymbol{x}, \boldsymbol{x} \rangle_W = \boldsymbol{x}^{T}W\boldsymbol{x}. $$ We can now consider the meansquared error in the seminorm induced by $W$ $$ \operatorname{MSE}_W(\hat{\boldsymbol{\varphi}}) := \mathbb{E}\{ \lVert \hat{\boldsymbol{\varphi}}  \boldsymbol{\phi} \rVert_W^2 \} $$ which has the following biasvariance decomposition that I prove at the end of this post: $$ \operatorname{MSE}_W(\hat{\boldsymbol{\varphi}}) = \operatorname{Tr}\{W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) \} + \lVert \operatorname{Bias}(\hat{\boldsymbol{\varphi}}) \rVert_W^2. $$ The biasvariance decomposition for the standard Euclidean norm is obtained by setting $W = \mathbb{I}$.
This means that the scalar obtained from the weighted trace of the covariance matrix of an unbiased estimator is actually equal to the meansquared error in the seminorm induced by $W$! In this reading, the weighting matrix is not an adhoc construction but mirrors the fact that the meansquared error is simply computed from a different seminorm.
We can prove the equality by some algebraic manipulations:
$$\begin{align*}
\operatorname{MSE}_W(\hat{\boldsymbol{\varphi}}) &= \mathbb{E}\{ \lVert \hat{\boldsymbol{\varphi}}  \boldsymbol{\phi} \rVert_W^2 \} \\
&= \mathbb{E}\{ (\hat{\boldsymbol{\varphi}}  \boldsymbol{\phi})^{T}W(\hat{\boldsymbol{\varphi}}  \boldsymbol{\phi}) \}\\
&= \mathbb{E}\{ \operatorname{Tr}\{ W(\hat{\boldsymbol{\varphi}}  \boldsymbol{\phi})(\hat{\boldsymbol{\varphi}}  \boldsymbol{\phi})^{T} \} \}\\
&= \operatorname{Tr}\{ W \mathbb{E}\{ (\hat{\boldsymbol{\varphi}}  \boldsymbol{\phi})(\hat{\boldsymbol{\varphi}}  \boldsymbol{\phi})^{T} \} \}\\
&= \operatorname{Tr}\{ W \mathbb{E}\{ (\hat{\boldsymbol{\varphi}}  \mathbb{E}\{\hat{\boldsymbol{\varphi}}\} + \operatorname{Bias}(\hat{\boldsymbol{\varphi}}))(\hat{\boldsymbol{\varphi}}  \mathbb{E}\{\hat{\boldsymbol{\varphi}}\} + \operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T} \} \}\\
&= \operatorname{Tr}\{ W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) + W(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T} \\
&\hphantom{\operatorname{Tr}\{ }+ W\mathbb{E}\{ (\hat{\boldsymbol{\varphi}}  \mathbb{E}\{\hat{\boldsymbol{\varphi}}\})\}(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T}\\
&\hphantom{\operatorname{Tr}\{ }+W(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))\mathbb{E}\{ (\hat{\boldsymbol{\varphi}}  \mathbb{E}\{\hat{\boldsymbol{\varphi}}\})^{T} \} \} \\
&=\operatorname{Tr}\{ W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) + W(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))(\operatorname{Bias}(\hat{\boldsymbol{\varphi}}))^{T} \} \\
&= \operatorname{Tr}\{ W\operatorname{Cov}(\hat{\boldsymbol{\varphi}}) \} + \lVert \operatorname{Bias}(\hat{\boldsymbol{\varphi}}) \rVert^2_W.
\end{align*}$$
In the fourth equality, we exploited the linearity of the trace to commute it with the expectation value. In the fifth equality, we used the identity $\boldsymbol{\phi} = \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\}\operatorname{Bias}(\hat{\boldsymbol{\varphi}})$. In the seventh equality we used the fact that $\mathbb{E}\{ \hat{\boldsymbol{\varphi}}  \mathbb{E}\{ \hat{\boldsymbol{\varphi}}\}\} = 0$.
Stein’s paradox#
One can not simply talk about the biasvariance decomposition and the related tradeoff without mentioning a particularly interesting example of the biasvariance tradeoff, namely Stein’s paradox. Assume that $\boldsymbol{\phi}$ is a vector with at least three components that are independent of each other and normally distributed. Let’s as an example consider the three components being the length of sea bass living in Lake Tahoe (are there sea bass at Lake Tahoe?), the height of buildings in Mexico City and the floor space of Berlin apartments. We would now like to estimate the mean vector of this distribution. In the usual formulation we are also requested to do so from seeing only one sample $\boldsymbol{x}$ of the distribution.
To this end, one would normally consider that estimating the mean as $\hat{\boldsymbol{\mu}}=\boldsymbol{x}$ would be the best approach, as this is also the maximumlikelihood estimate. But this is actually only true if the dimension of the data is less than $3$! “Shrinking” the estimator towards an arbitrary point will always result in an estimator that is both biased and correlated but still achieves lower meansquared error. This effect is paradox because we assumed that the components of our distribution are independent, and yet there is a correlated estimator with superior performance.
A putative explanation of this paradox is as follows: We get a decreased meansquared error when we shrink towards the true mean. But if we take the limit of the JamesStein estimator when shrinking towards infinity we actually recover the estimator $\hat{\boldsymbol{\mu}}=\boldsymbol{x}$. So shrinking towards any point will necessarily improve our estimate, as any point is better than infinity.

Chris Yeh (2019). BiasVariance Decomposition of Mean Squared Error ↩︎

Meyer, J. J., Borregaard, J., & Eisert, J. (2020). A variational toolbox for quantum multiparameter estimation. arXiv preprint arXiv:2006.06303. ↩︎