# Lightning-GPU device¶

The lightning.gpu device is an extension of PennyLane’s built-in lightning.qubit device. It extends the CPU-focused Lightning simulator to run using the NVIDIA cuQuantum SDK, enabling GPU-accelerated simulation of quantum state-vector evolution.

A lightning.gpu device can be loaded using:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=2)


If the NVIDIA cuQuantum libraries are available, the above device will allow all operations to be perfomed on a CUDA capable GPU of generation SM 7.0 (Volta) and greater. If the libraries are not correctly installed, or available on path, the device will fall-back to lightning.qubit and perform all simulation on the CPU.

The lightning.gpu device also directly supports quantum circuit gradients using the adjoint differentiation method. This can be enabled at the PennyLane QNode level with:

qml.qnode(dev, diff_method="adjoint")
def circuit(params):
...


## Supported operations and observables¶

Supported operations:

 BasisState Prepares a single computational basis state. CNOT The controlled-NOT operator CRot The controlled-Rot operator CRX The controlled-RX operator CRY The controlled-RY operator CRZ The controlled-RZ operator Hadamard The Hadamard operator PauliX The Pauli X operator PauliY The Pauli Y operator PauliZ The Pauli Z operator PhaseShift Arbitrary single qubit local phase shift ControlledPhaseShift A qubit controlled phase shift. QubitStateVector Prepare subsystems using the given ket vector in the computational basis. Rot Arbitrary single qubit rotation RX The single qubit X rotation RY The single qubit Y rotation RZ The single qubit Z rotation S The single-qubit phase gate T The single-qubit T gate

Supported observables:

 Hadamard The Hadamard operator Identity The identity observable $$\I$$. PauliX The Pauli X operator PauliY The Pauli Y operator PauliZ The Pauli Z operator Hamiltonian Operator representing a Hamiltonian.

The lightning.gpu device directly supports the adjoint differentiation method, and enables parallelization over the requested observables. This supports direct controlling of observable batching, which can be used to run concurrent calculations across multiple available GPUs.

If you are computing a large number of expectation values, or if you are using a large number of wires on your device, it may be best to evenly divide the number of expectation value calculations across all available GPUs. This will reduce the overall memory cost of the obseravbles per GPU, at the cost of additional compute time. Assuming m observables, and n GPUs, the default behaviour is to pre-allocate all storage for n observables on a single GPU. To divide the workload amongst many GPUs, initialize a lightning.gpu device with the batch_obs=True keyword argument, as:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=20, batch_obs=True)


With the above, each GPU will see at most m/n observables to process, reducing the preallocated memory footprint.

Additionally, there can be situations where even with the above distribution, and limited GPU memory, the overall problem does not fit on the requested GPU devices. You can further reduce the concurrent allocations on available GPUs by providing an integer value to the batch_obs keyword. For example, to batch evaluate observables with at most 1 observable allocation per GPU, define the device as:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=27, batch_obs=1)


Each problem is unique, so it can often be best to choose the default behaviour up-front, and tune with the above only if necessary.