Kernel performance tuning

OpenMP threaded kernels

OpenMP acceleration of gate kernels across all kernel types (LM, AVX2, and AVX512) is enabled by default on Linux and MacOS wheels in Lightning-Qubit.

On other operating systems, OpenMP support can be enabled by setting the environment variable LQ_ENABLE_KERNEL_OMP=ON before starting your Python session, or if already running, before simulating your PennyLane programs. You can also control the number of threads used by setting the OMP_NUM_THREADS environment variable.

For workloads that involve gradient computations with many observable measurements, OpenMP acceleration may reduce performance due to oversubscription of threads to CPU cores. To mitigate this, use the CMake flag -DLQ_ENABLE_KERNEL_OMP=OFF when building Lightning-Qubit.

For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck, and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the -DLQ_ENABLE_KERNEL_AVX_STREAMING=on CMake flag. This forces the data to avoid updating the CPU cache and can improve performance for larger workloads.