Lightning GPU device¶

The lightning.gpu device is an extension of PennyLane’s built-in lightning.qubit device. It extends the CPU-focused Lightning simulator to run using the NVIDIA cuQuantum SDK, enabling GPU-accelerated simulation of quantum state-vector evolution.

A lightning.gpu device can be loaded using:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=2)

If the NVIDIA cuQuantum libraries are available, the above device will allow all operations to be performed on a CUDA capable GPU of generation SM 7.0 (Volta) and greater. If the libraries are not correctly installed, or available on path, the device will raise an error.

The lightning.gpu device supports quantum circuit gradients using the adjoint differentiation method. By default, this method is enabled. It can also be explicitly specified using the diff_method argument when creating a device:

@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
    ...

Check out the Lightning-GPU installation guide for more information.

Supported operations and observables¶

Supported operations:

`BasisState`	Prepares a single computational basis state.
`BlockEncode`	Construct a unitary $U(A)$ such that an arbitrary matrix $A$ is encoded in the top-left block.
`CNOT`	The controlled-NOT operator
`ControlledPhaseShift`	A qubit controlled phase shift.
`ControlledQubitUnitary`	Apply an arbitrary fixed unitary matrix `U` to `wires`.
`CRot`	The controlled-Rot operator
`CRX`	The controlled-RX operator
`CRY`	The controlled-RY operator
`CRZ`	The controlled-RZ operator
`CSWAP`	The controlled-swap operator
`CY`	The controlled-Y operator
`CZ`	The controlled-Z operator
`DiagonalQubitUnitary`	Apply an arbitrary diagonal unitary matrix with a dimension that is a power of two.
`DoubleExcitation`	Double excitation rotation.
`DoubleExcitationMinus`	Double excitation rotation with negative phase-shift outside the rotation subspace.
`DoubleExcitationPlus`	Double excitation rotation with positive phase-shift outside the rotation subspace.
`ECR`	An echoed RZX( $\pi/2$ ) gate.
`GlobalPhase`	A global phase operation that multiplies all components of the state by $e^{-i \phi}$ .
`Hadamard`	The Hadamard operator
`Identity`	The Identity operator
`IsingXX`	Ising XX coupling gate
`IsingXY`	Ising (XX + YY) coupling gate
`IsingYY`	Ising YY coupling gate
`IsingZZ`	Ising ZZ coupling gate
`ISWAP`	The i-swap operator
`MultiControlledX`	Apply a `PauliX` gate controlled on an arbitrary computational basis state.
`MultiRZ`	Arbitrary multi Z rotation.
`OrbitalRotation`	Spin-adapted spatial orbital rotation.
`PauliX`	The Pauli X operator
`PauliY`	The Pauli Y operator
`PauliZ`	The Pauli Z operator
`PCPhase`	A projector-controlled phase gate.
`PhaseShift`	Arbitrary single qubit local phase shift
`PSWAP`	Phase SWAP gate
`QubitCarry`	Apply the `QubitCarry` operation to four input wires.
`QubitSum`	Apply a `QubitSum` operation on three input wires.
`QubitUnitary`	Apply an arbitrary unitary matrix with a dimension that is a power of two.
`Rot`	Arbitrary single qubit rotation
`RX`	The single qubit X rotation
`RY`	The single qubit Y rotation
`RZ`	The single qubit Z rotation
`S`	The single-qubit phase gate
`SingleExcitation`	Single excitation rotation.
`SingleExcitationMinus`	Single excitation rotation with negative phase-shift outside the rotation subspace.
`SingleExcitationPlus`	Single excitation rotation with positive phase-shift outside the rotation subspace.
`SISWAP`	The square root of i-swap operator.
`SQISW`	alias of `SISWAP`
`SWAP`	The swap operator
`SX`	The single-qubit Square-Root X operator.
`T`	The single-qubit T gate
`Toffoli`	Toffoli (controlled-controlled-X) gate.

Supported observables:

`Identity`	The Identity operator
`Hadamard`	The Hadamard operator
`PauliX`	The Pauli X operator
`PauliY`	The Pauli Y operator
`PauliZ`	The Pauli Z operator
`Projector`	Observable corresponding to the state projector $P=\ket{\phi}\bra{\phi}$ .
`Hermitian`	An arbitrary Hermitian observable.
`Hamiltonian`	alias of `LinearCombination`
`SparseHamiltonian`	A Hamiltonian represented directly as a sparse matrix in Compressed Sparse Row (CSR) format.
`Exp`	A symbolic operator representing the exponential of a operator.
`Prod`	Symbolic operator representing the product of operators.
`SProd`	Arithmetic operator representing the scalar product of an operator with the given scalar.
`Sum`	Symbolic operator representing the sum of operators.

Parallel adjoint differentiation support:

The lightning.gpu device directly supports the adjoint differentiation method, and enables parallelization over the requested observables. This supports direct controlling of observable batching, which can be used to run concurrent calculations across multiple available GPUs.

If you are computing a large number of expectation values, or if you are using a large number of wires on your device, it may be best to evenly divide the number of expectation value calculations across all available GPUs. This will reduce the overall memory cost of the observables per GPU, at the cost of additional compute time. Assuming m observables, and n GPUs, the default behaviour is to pre-allocate all storage for n observables on a single GPU. To divide the workload amongst many GPUs, initialize a lightning.gpu device with the batch_obs=True keyword argument, as:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=20, batch_obs=True)

With the above, each GPU will see at most m/n observables to process, reducing the preallocated memory footprint.

Additionally, there can be situations where even with the above distribution, and limited GPU memory, the overall problem does not fit on the requested GPU devices. You can further reduce the concurrent allocations on available GPUs by providing an integer value to the batch_obs keyword. For example, to batch evaluate observables with at most 1 observable allocation per GPU, define the device as:

import pennylane as qml
dev = qml.device("lightning.gpu", wires=27, batch_obs=1)

Each problem is unique, so it can often be best to choose the default behaviour up-front, and tune with the above only if necessary.

Multi-GPU/multi-node support:

The lightning.gpu device allows users to leverage the computational power of many GPUs distributed across multiple nodes for running large-scale simulations. Provided that NVIDIA cuQuantum libraries, a CUDA-aware MPI library and mpi4py are properly installed and the path to the libmpi.so is added to the LD_LIBRARY_PATH environment variable, the following requirements should be met to enable multi-node and multi-GPU simulations:

The mpi keyword argument should be set as True when initializing a lightning.gpu device.
Both the total number of MPI processes and MPI processes per node must be powers of 2. For example, 2, 4, 8, 16, etc.. Each MPI process is responsible for managing one GPU.

The workflow for the multi-node/GPUs feature is as follows:

from mpi4py import MPI
import pennylane as qml
dev = qml.device('lightning.gpu', wires=8, mpi=True)
@qml.qnode(dev)
def circuit_mpi():
    qml.PauliX(wires=[0])
    return qml.state()
local_state_vector = circuit_mpi()

Currently, a lightning.gpu device with the MPI multi-GPU backend supports all the gate operations and observables that a lightning.gpu device with a single GPU/node backend supports.

By default, each MPI process will return the overall simulation results, except for the qml.state() and qml.prob() methods for which each MPI process only returns the local simulation results for the qml.state() and qml.prob() methods to avoid buffer overflow. It is the user’s responsibility to ensure correct data collection for those two methods. Here are examples of collecting the local simulation results for qml.state() and qml.prob() methods:

The workflow for collecting local state vector (using the qml.state() method) to rank 0 is as follows:

from mpi4py import MPI
import pennylane as qml
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
dev = qml.device('lightning.gpu', wires=8, mpi=True)
@qml.qnode(dev)
def circuit_mpi():
    qml.PauliX(wires=[0])
    return qml.state()
local_state_vector = circuit_mpi()
#rank 0 will collect the local state vector
state_vector = comm.gather(local_state_vector, root=0)
if rank == 0:
    print(state_vector)

The workflow for collecting local probability (using the qml.prob() method) to rank 0 is as follows:

from mpi4py import MPI
import pennylane as qml
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
dev = qml.device('lightning.gpu', wires=8, mpi=True)
prob_wires = [0, 1]

@qml.qnode(dev)
def mpi_circuit():
    qml.Hadamard(wires=1)
    return qml.probs(wires=prob_wires)

local_probs = mpi_circuit()

#For data collection across MPI processes.
recv_counts = comm.gather(len(local_probs),root=0)
if rank == 0:
    probs = np.zeros(2**len(prob_wires))
else:
    probs = None

comm.Gatherv(local_probs,[probs,recv_counts],root=0)
if rank == 0:
    print(probs)

Then the python script can be executed with the following command:

$ mpirun -np 4 python yourscript.py

Furthermore, users can optimize the performance of their applications by allocating the appropriate amount of GPU memory for MPI operations with the mpi_buf_size keyword argument. To allocate n mebibytes (MiB, 2^20 bytes) of GPU memory for MPI operations, initialize a lightning.gpu device with the mpi_buf_size=n keyword argument, as follows:

from mpi4py import MPI
import pennylane as qml
n = 8
dev = qml.device("lightning.gpu", wires=20, mpi=True, mpi_buf_size=n)

Note the value of mpi_buf_size should also be a power of 2. Remember to carefully manage the mpi_buf_size parameter, taking into account the available GPU memory and the memory requirements of the local state vector, to prevent memory overflow issues and ensure optimal performance. By default (mpi_buf_size=0), the GPU memory allocated for MPI operations will match the size of the local state vector, with a limit of 64 MiB. Please be aware that a runtime warning will occur if the local GPU memory buffer for MPI operations exceeds the GPU memory allocated to the local state vector.

Multi-GPU/multi-node support for adjoint method:

The lightning.gpu device with the multi-GPU/multi-node backend also directly supports the adjoint differentiation method. Instead of batching observables across the multiple GPUs available within a node, the state vector is distributed among the available GPUs with the multi-GPU/multi-node backend. By default, the adjoint method with MPI support follows the performance-oriented implementation of the single GPU backend. This means that a separate bra is created for each observable and the ket is updated only once for each operation, regardless of the number of observables.

The workflow for the default adjoint method with MPI support is as follows:

from mpi4py import MPI
import pennylane as qml
from pennylane import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
n_wires = 20
n_layers = 2

dev = qml.device('lightning.gpu', wires= n_wires, mpi=True)
@qml.qnode(dev, diff_method="adjoint")
def circuit_adj(weights):
    qml.StronglyEntanglingLayers(weights, wires=list(range(n_wires)))
    return qml.math.hstack([qml.expval(qml.PauliZ(i)) for i in range(n_wires)])

if rank == 0:
    params = np.random.random(qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_wires))
else:
    params = None

params = comm.bcast(params, root=0)
jac = qml.jacobian(circuit_adj)(params)

If users aim to handle larger system sizes with limited hardware resources, the memory-optimized adjoint method with MPI support is more appropriate. The memory-optimized adjoint method with MPI support employs a single bra object that is reused for all observables. This approach results in a notable reduction in the required GPU memory when dealing with a large number of observables. However, it’s important to note that the reduction in memory requirement may come at the expense of slower execution due to the multiple ket updates per gate operation.

To enable the memory-optimized adjoint method with MPI support, batch_obs should be set as True and the workflow follows:

dev = qml.device('lightning.gpu', wires= n_wires, mpi=True, batch_obs=True)

For the adjoint method, each MPI process will provide the overall simulation results.

Note

The observable Projector does not have support with the multi-GPU backend.

Lightning GPU device¶

Supported operations and observables¶

Contents

Downloads

Lightning GPU device¶

Supported operations and observables¶

Contents

Downloads

Related