the noisy parameter-shift rule.

The parameter-shift rule is a workhorse for the optimization of parametrized quantum circuits. In this post, I explain how it also works in a noisy context and how we can extend it to convex combinations of quantum channels. This blog post is a summary of Appendix A of our paper A variational toolbox for quantum multi-parameter estimation 1.

Variational Quantum Algorithms are candidates for a successful application of Noisy Intermediate-Scale Quantum devices. The weights of a parametrized quantum circuit (PQC) are tuned to minimize a cost function encoding the problem in question, e.g. a prediction error for a learning task. Harrow and Napp showed 2 that gradient information provably improves such optimizations and luckily for us, gradient information can be obtained on quantum hardware! We could always make use of finite-differences to compute the gradients, but the nature of the quantum gates used to construct the parametrized quantum circuits enable a more elaborate strategy, namely the parameter-shift rule.


Usually, cost functions are phrased in terms of expectation values of some observables ${ O_j }$, evaluated on a parametrized quantum state $\rho$, $$ \langle O \rangle = \operatorname{Tr}(\rho O). $$ In the case that $\rho$ depends on a parameter $\mu$ that parametrizes a Pauli-rotation gate $$ U(\mu) = e^{-i \mu P / 2}, $$ we can compute the derivative with respect to $\mu$ as $$ \partial_{\mu} \langle O \rangle(\mu) = \frac{1}{2}\left(\langle O \rangle(\mu + \frac{\pi}{2}) - \langle O \rangle(\mu - \frac{\pi}{2}) \right) $$

To my knowledge, the parameter-shift rule has first arisen in the context of quantum optimal control in work by Li et al. 3 and was then adapted for quantum circuits by Mitarai et al. 4. Schuld et al. 5 give a good overview of the topic and provide new parameter-shift rules for Gaussian gates used in continuous-variable applications. Recently, Banchi and Crooks 6 proposed a stochastic generalization of the parameter-shift rule which expands its applicability significantly.

Noisy parameter-shift rule#

The proofs in the references I cited were using the pure state formalism. But quantum computers are inherently noisy, so it is quite natural to ask if the parameter-shift rule still holds in that case. To show under which conditions this is true, we first have to change our picture from unitaries to quantum channels. A quantum channel is the most general processus admissible in quantum mechanics 7 and formalizes unitary evolutions, measurements, noise channels and so on – basically any processus mapping a valid density matrix to a valid density matrix can be expressed as a quantum channel.

In the picture of quantum channels, the requirement for the existence of a parameter-shift rule is that the channel in question can express its own derivative as a linear combination: $$ \partial_{\mu} \mathcal{N}(\mu) = \sum_j c_j \mathcal{N}(f_j(\mu)). $$ By linearity of the trace, we then have $$\begin{align*} \partial_{\mu} \langle O \rangle(\mu) & = \partial_{\mu} \operatorname{Tr}(O \mathcal{N}(\mu)[\rho]) \\
& = \operatorname{Tr}(O \partial_{\mu} \mathcal{N}(\mu)[\rho]) \\
& = \operatorname{Tr}(O \sum_j c_j \mathcal{N}(f_j(\mu))[\rho]) \\
& = \sum_j c_j \operatorname{Tr}(O \mathcal{N}(f_j(\mu))[\rho]) \\
& = \sum_j c_j \langle O \rangle(f_j(\mu)) \end{align*} $$

It turns out that any quantum channel that can express its own derivative also admits a parameter-shift rule if it is followed and prepended by a noise channel that is independent of the parameter of the quantum channel. We can thus write our noisy gate model as a concatenation of the channel itself with a noise channel $\mathcal{B}$ before and $\mathcal{A}$ after: $$ \mathcal{V}(\mu) = \mathcal{A} \circ \mathcal{N}(\mu) \circ \mathcal{B}. $$ We can show that this channel admits the same parameter-shift rule as $\mathcal{N}$, $$\begin{align*} \partial_{\mu} \langle O \rangle(\mu) & = \partial_{\mu} \operatorname{Tr}(O \mathcal{V}(\mu)[\rho]) \\
& = \partial_{\mu} \operatorname{Tr}(O (\mathcal{A} \circ \mathcal{N}(\mu) \circ \mathcal{B}))[\rho]) \\
& = \partial_{\mu} \operatorname{Tr}(\mathcal{A}^{\dagger}[O] \mathcal{N}(\mu)[\mathcal{B}[\rho]]) \\
& = \sum_j c_j \operatorname{Tr}(\mathcal{A}^{\dagger}[O] \mathcal{N}(f_j(\mu))[\mathcal{B}[\rho]]) \\
& = \sum_j c_j \operatorname{Tr}(O (\mathcal{A} \circ \mathcal{N}(f_j(\mu)) \circ \mathcal{B})[\rho]) \\
& = \sum_j c_j \operatorname{Tr}(O \mathcal{V}(f_j(\mu))[\rho]) \\
& = \sum_j c_j \langle O \rangle(f_j(\mu)), \end{align*} $$ where we exploited the linearity of the trace and used the adjoint channel of $\mathcal{A}$ to temporarily shift the noise to the observable. The same argument holds if the sum is an integral, which means that it also applies to the techniques developed in Ref. 6.


We will now discuss what kinds of gates can be modeled in this way. First and foremost, we need to take care of unitary evolutions: It was proven in Ref. 5 that a parameter-shift rule exists for all unitary quantum gates generated by an operator $G$ that has only two distinct eigenvalues, which most prominently includes the Pauli operators and their tensor products: $$ \mathcal{G}(\mu)[\rho] = e^{- i \mu G} \rho e^{i \mu G}. $$ It is straightforward to show that the results of Ref. 5 also imply that the above channel can express its own derivative, you find the argument in Appendix A of Ref. 1. If $r$ is the absolute difference of the two distinct eigenvalues, then the derivative is $$ \partial_{\mu} \mathcal{G}(\mu) = r \left[ \mathcal{G}\left(\mu + \frac{\pi}{4r}\right) - \mathcal{G}\left(\mu - \frac{\pi}{4r}\right) \right]. $$

We can combine the unitary evolution with a subsequent dephasing, depolarizing or loss channel to get a model of our noisy gate. Another possibility is the modeling of control noise. It is impossible to perform a unitary $\mathcal{G}(\mu)$ to arbitrary precision in $\mu$. We can model this as follows: assume we try to implement $\mathcal{G}(\mu)$, but the actually performed gate is $\mathcal{G}(\mu + \chi)$ where $\chi$ is distributed according to a probability distribution $p(\chi)$, e.g. a zero-centered normal distribution. Due to the additivity of time evolutions, we can write $\mathcal{G}(\mu + \chi) = \mathcal{G}(\chi) \circ \mathcal{G}(\mu)$. Taking into account the probability distribution $p(\chi)$ yields the channel $$ \mathcal{N}[\rho] = \int_{-\pi}^{\pi} \mathrm{d} \chi \\ p(\chi) \mathcal{G}(\chi)[\rho] $$ that models the control noise of our gate whose noisy model would then be $$ \mathcal{V}(\mu) = \mathcal{N} \circ \mathcal{G}(\mu) $$ which is differentiable via the same parameter-shift rule as $\mathcal{G}$ itself.

Convex combinations of quantum channels#

We will now move on to show that parameter-shift rules are not limited to unitary evolutions alone. Convex combinations of channels also admit a parameter-shift rule that can be exploited in simulation and emulation contexts. Let’s consider the channel $$ \mathcal{N}(p)[\rho] = (1-p) \mathcal{N}_1[\rho] + p \mathcal{N}_2[\rho], $$ interpolating between $\mathcal{N}_1$ and $\mathcal{N}_2$, arbitrary quantum channels of compatible dimensions. This model includes a lot of well known channels found in the quantum information literature, like the dephasing channel $$ \mathcal{N}(p)[\rho] = (1-p) \rho + p Z \rho Z, $$ the depolarizing channel $$ \mathcal{N}(p)[\rho] = (1-p) \rho + p \left(\frac{1}{3} X \rho X + \frac{1}{3} Y \rho Y + \frac{1}{3} Z \rho Z\right), $$ which is a special case of a replacement channel $$ \mathcal{N}(p)[\rho] = (1-p) \rho + p \varepsilon, $$ where $\varepsilon$ is the replacement state.

One often encounters these channels in a form where the probability is expressed in terms of an error rate $\alpha$ as $$ p(t) = \frac{1 - e^{-\alpha t}}{2}. $$ The factor $1/2$ accounts for the fact that (usually) the maximum noise corresponds to an equal mixing between both channels.

We will now prove that interpolated channels of this form allow for the following parameter-shift rule: $$ \partial_p \mathcal{N}(p)[\rho] = \frac{1}{q_1 - q_2}\left( \mathcal{N}(q_1)[\rho] - \mathcal{N}(q_2)[\rho] \right) $$ for all distinct $q_1, q_2 \in [0, 1]$.

To prove this, we rephrase the application of $\mathcal{N}$ in vector notation: $$ \mathcal{N}(p)[\rho] = \begin{pmatrix} 1-p \\p \end{pmatrix} \begin{pmatrix} \mathcal{N}_1[\rho] \\ \mathcal{N}_2[\rho] \end{pmatrix}, $$ where the implicit product of vectors is the scalar product. The derivative is given by $$ \partial_p \mathcal{N}(p)[\rho] = \begin{pmatrix} -1 \\ 1 \end{pmatrix} \begin{pmatrix} \mathcal{N}_1[\rho] \\ \mathcal{N}_2[\rho] \end{pmatrix}. $$ Now, assume we have access to two indepenent realizations of the channel at $q_1$ and $q_2$. We then get the vector of outcomes $$ \begin{pmatrix} \mathcal{N}(q_1)[\rho] \\ \mathcal{N}(q_2)[\rho]\end{pmatrix} = \begin{pmatrix} 1-q_1 & q_1 \\ 1-q_2 & q_2 \end{pmatrix}\begin{pmatrix} \mathcal{N}_1[\rho] \\ \mathcal{N}_2[\rho] \end{pmatrix}. $$ If $q_1 \neq q_2$, the matrix on the right is invertible, which allows us to compute the derivative as $$\begin{align*} \partial_p \mathcal{N}(p)[\rho] & = \begin{pmatrix} -1 \\ 1 \end{pmatrix} \begin{pmatrix} \mathcal{N}_1[\rho] \\ \mathcal{N}_2[\rho] \end{pmatrix} \\
& =\begin{pmatrix} -1 \\ 1 \end{pmatrix}\begin{pmatrix} 1-q_1 & q_1 \\ 1-q_2 & q_2 \end{pmatrix}^{-1}\begin{pmatrix} \mathcal{N}(q_1)[\rho] \nonumber\\ \mathcal{N}(q_2)[\rho]\end{pmatrix} \\
& = \frac{1}{q_1 - q_2}\left(\mathcal{N}(q_1)[\rho] - \mathcal{N}(q_2)[\rho]\right). \nonumber \end{align*} $$ The proof technique extends analogously to channels, which are convex combinations of more than two channels.

If we have $q_1 = \frac{1}{2}(1 - e^{-\alpha t_1})$ and $q_2 = \frac{1}{2}(1 - e^{- \alpha t_2})$, we can apply the chain rule to get the alternative formulation $$ \partial_t \mathcal{N}(t)[\rho] = \frac{\partial p}{\partial t} \partial_p \mathcal{N}(p(t))[\rho] = \frac{\alpha e^{-\alpha t}}{e^{- \alpha t_2} - e^{- \alpha t_1}}\left( \mathcal{N}(t_1)[\rho] - \mathcal{N}(t_2)[\rho] \right) $$ that works for any distinct $t_1, t_2 \in [0, \infty)$.


We have shown that the parameter-shift rule still holds under a noise model where unitaries are perfect, but followed by arbitrary noise channels, because of the linearity of the trace. We furthermore learned that parameter-shift rules aren’t even limited to unitaries alone, but that a large class of quantum channels also admits a parameter-shift rule!

If you want to explore this further, the folks at Xanadu provided a PennyLane demonstration showcasing the optimization of noisy quantum circuits with the parameter-shift rule. Please check it out!

  1. Meyer, J. J., Borregaard, J., & Eisert, J. (2020). A variational toolbox for quantum multi-parameter estimation. arXiv preprint arXiv:2006.06303. ↩︎

  2. Harrow, A., & Napp, J. (2019). Low-depth gradient measurements can improve convergence in variational hybrid quantum-classical algorithms. arXiv preprint arXiv:1901.05374. ↩︎

  3. Li, J., Yang, X., Peng, X., & Sun, C. P. (2017). Hybrid quantum-classical approach to quantum optimal control. Physical review letters, 118(15), 150503. arXiv preprint arXiv:1608.00677. ↩︎

  4. Mitarai, K., Negoro, M., Kitagawa, M., & Fujii, K. (2018). Quantum circuit learning. Physical Review A, 98(3), 032309. arXiv preprint arXiv:1803.00745 ↩︎

  5. Schuld, M., Bergholm, V., Gogolin, C., Izaac, J., & Killoran, N. (2019). Evaluating analytic gradients on quantum hardware. Physical Review A, 99(3), 032331. arXiv preprint arXiv:1811.11184. ↩︎

  6. Banchi, L., & Crooks, G. E. (2020). Measuring Analytic Gradients of General Quantum Evolution with the Stochastic Parameter Shift Rule. arXiv preprint arXiv:2005.10299. ↩︎

  7. Wilde, M. M. (2011). From classical to quantum Shannon theory. arXiv preprint arXiv:1106.1445. ↩︎