Lightning-GPU device

The lightning.gpu device is an extension of PennyLane’s built-in lightning.qubit device. It extends the CPU-focused Lightning simulator to run using the NVIDIA cuQuantum SDK, enabling GPU-accelerated simulation of quantum state-vector evolution.

A lightning.gpu device can be loaded using:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=2)

If the NVIDIA cuQuantum libraries are available, the above device will allow all operations to be perfomed on a CUDA capable GPU of generation SM 7.0 (Volta) and greater. If the libraries are not correctly installed, or available on path, the device will fall-back to lightning.qubit and perform all simulation on the CPU.

The lightning.gpu device also directly supports quantum circuit gradients using the adjoint differentiation method. This can be enabled at the PennyLane QNode level with:

qml.qnode(dev, diff_method="adjoint")
def circuit(params):

Supported operations and observables

Supported operations:


Prepares a single computational basis state.


The controlled-NOT operator


The controlled-Rot operator


The controlled-RX operator


The controlled-RY operator


The controlled-RZ operator


The Hadamard operator


The Pauli X operator


The Pauli Y operator


The Pauli Z operator


Arbitrary single qubit local phase shift


A qubit controlled phase shift.


Prepare subsystems using the given ket vector in the computational basis.


Arbitrary single qubit rotation


The single qubit X rotation


The single qubit Y rotation


The single qubit Z rotation


The single-qubit phase gate


The single-qubit T gate

Supported observables:


The Hadamard operator


The identity observable \(\I\).


The Pauli X operator


The Pauli Y operator


The Pauli Z operator


Operator representing a Hamiltonian.

Parallel adjoint differentiation support:

The lightning.gpu device directly supports the adjoint differentiation method, and enables parallelization over the requested observables. This supports direct controlling of observable batching, which can be used to run concurrent calculations across multiple available GPUs.

If you are computing a large number of expectation values, or if you are using a large number of wires on your device, it may be best to evenly divide the number of expectation value calculations across all available GPUs. This will reduce the overall memory cost of the obseravbles per GPU, at the cost of additional compute time. Assuming m observables, and n GPUs, the default behaviour is to pre-allocate all storage for n observables on a single GPU. To divide the workload amongst many GPUs, initialize a lightning.gpu device with the batch_obs=True keyword argument, as:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=20, batch_obs=True)

With the above, each GPU will see at most m/n observables to process, reducing the preallocated memory footprint.

Additionally, there can be situations where even with the above distribution, and limited GPU memory, the overall problem does not fit on the requested GPU devices. You can further reduce the concurrent allocations on available GPUs by providing an integer value to the batch_obs keyword. For example, to batch evaluate observables with at most 1 observable allocation per GPU, define the device as:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=27, batch_obs=1)

Each problem is unique, so it can often be best to choose the default behaviour up-front, and tune with the above only if necessary.