Porting of a particle-in-cell code to exascale architectures for Aurora

Currently running on the vector CPU-based theta machine at the Argonne Leadership Computing Facility as well as on the GPU-accelerated Summit system at the Oak Ridge Leadership Computing Facility, XGC is capable of resolving cross-border plasma problems across the magnet to solve Separatrix (i.e. the boundary between magnetically limited and unbounded plasmas) using kinetic equations based on first principles.

In order to prepare for the next generation of high-performance computing – exemplified by the upcoming Polaris and Aurora systems from ALCF – the code for exascale is being re-implemented using a performance-portable approach. Running at exascale offers unique computational skills, some of which have the potential for transformational implications for fusion science: for example, the exascale extension allows one to study a larger and more realistic range of dimensionless plasma parameters than has ever been achieved before. along with the full spectrum of kinetic microinstabilities that control the quality of energy confinement in a toroidal plasma. In addition, exascale will enable physical modeling that includes tungsten ion species with multiple charges – impurities released from the tokamak vessel walls that affect edge plasma behavior and fusion performance in the nuclear plasma by migrating across the magnetic separatrix.

Portability practices

Preparing the Aurora exascale machine in a way that is portable to other architectures uses non-machine-specific libraries and high-level programming models – Kokkos and Cabana. While the former was done before the exascale computing project, both Kokkos and Cabana target first-generation exascale computing platforms.

As a best practice for code development, the XGC team capitalizes on these efforts by employing high-level interfaces and libraries. In this way, they can benefit directly from the work of the developers of library and programming models.

Additionally, with no code changes, the team can take advantage of Kokkos’ upcoming SYCL / DPC ++ implementation, which is expected to perform well at the time of publication and be largely portable across all architectures. In the meantime, the team is working on an early OpenMP target implementation by Kokkos.


The team’s application can run from the start on any platform that supports the underlying software. These factors led the team to move XGC from proprietary programming approaches (such as OpenACC, CUDA, and Fortran) to Kokkos and Cabana for GPU acceleration.

As soon as the change was affected and the relevant programming levels were integrated into the XGC code, the team achieved comparable or improved performance compared to vendor-specific implementations.

XGC contains two computationally intensive kernels: one for kinetic electron pressure and one for non-linear Fokker-Planck collisions. If no GPUs are used, these kernels take up more than 95 percent of the processing time in the production run.

The electron push kernel was the first application component implemented in the past that used OpenMP threads for CPUs and vectorization techniques for architectures such as Intel Knights Landing (KNL) and CUDA Fortran for NVIDIA GPUs.

It was re-implemented using the Cabana library, a layer on coconut to implement particle codes. Following this implementation, it was determined that the computation, with minimal additional effort on the part of the XGC team, matched or outperformed the previous kernel implementation on Summit and matched or outperformed the previous kernel implementation running on theta.

The collision kernel also showed a comparable or improved performance compared to its OpenACC implementation after its porting with Kokkos.

Optimal performance for various architectures has been achieved through the use of practical structures for storing particle data that Cabana provides in combination with the Kokkos implementation on which it relies.

Click here to learn more.


Comments are closed.