Argonne Notes on the Future Architecture of the Aurora Exascale System

There are two supercomputers called “Aurora” affiliated with Argonne National Laboratory – the one that was scheduled to be built this year and the one that was briefly known as the “A21” last year and will be built in 2021. and this will be the first exascale system to be built in the United States.

Details on the second and now only major Aurora system have just emerged as Argonne put forward proposals for the early scientific program that would allow researchers to put code on the supercomputer for three months before it begins its production work. The proposal gives a glimpse of what the second Aurora machine might look like, and shares some similarities with the Aurora machine that Intel and Cray were expecting this year.

The new information comes from the wake of Oak Ridge National Laboratory and Lawrence Livermore National Laboratory, who provide a timeline for their respective Exa supercomputers, Frontier and El Capitan. It looks like Frontier is coming in early 2022, with El Capitan coming later in the year. Their architecture has not yet been announced, but both Oak Ridge and Lawrence Livermore will stick with hybrid CPU-GPU systems, consisting of IBM’s Power10 processors and an as-yet-un-unveiled future Nvidia GPU accelerator. But no one has made promises about this.

The original Aurora machine was to be developed by Intel and Cray, using the now-discontinued “Knights Hill” Xeon Phi multi-core processor and the still-awaited 200 Gb / s Omni-Path 200 link, both from Intel. The system should be based on Cray’s future “Shasta” system design, and the omni-path connection should be supported with a healthy dose of the technology underlying Cray’s “Aries” Dragonfly connection. (Intel bought Cray’s interconnect business back in 2013 for just that purpose.) The original Aurora design was to have 7 PB of integrated high-bandwidth storage with a total bandwidth of more than 30 PB / s of bandwidth over the 50,000+. have nodes in the system. The Omni-Path 200 links should be on the Xeon Phi package and should provide 2.5 PB / s of aggregated node link bandwidth and greater than 500 TB / s of interconnect bisection bandwidth.

Keep flashing to the new and improved Aurora, and we now know that it will also have around 50,000 nodes and 5 PB of various types of storage when installation begins in late 2020 and early 2021 and scheduled for acceptance in early 2022 . (Those feeds and speeds are the new bit.) The nodes will be equipped with high bandwidth memory, and we suspect Intel will be using the same HBM memory used on GPU accelerators, not the MCDRAM memory it does created with Micron Technology, and that It had used in the “Knights Landing” Xeon Phi 7000 series accelerators and standalone processors.

As you can see, the number of nodes is roughly the same between the two different versions of the Aurora machine, but the computational overhead, which is believed to be expressed in double-precision floating point math, as is tradition in the HPC world, is around a Up a factor of at least 5.5X from the 180 petaflops expected with the original Aurora (let’s call it A18 so we have a name for it) to the 1,000+ petaflops of the A21 system that Intel and Cray are building for Argonne . The memory requirements of the A21 computer are actually 30 percent less. This seems like an odd choice unless the memory is unified to become more coherent across serial and parallel compute elements. This is exactly what we expect on the computing front, and the Lawrence Berkeley National Laboratory has put some thought into the future of accelerated computing that leads us to suspect that the future NERSC-10 exascale machine will be based on the same type of computing, if also possibly a successor generation who will indicate their timing.

Here is in italics what Argonne shared with the researchers applying to run jobs on the future A21 machine, which provides a glimpse into the architecture:

  • Nodes have both high single-threaded core performance and the ability to achieve exceptional performance when the code is modestly sized at the same time.
  • The architecture is optimized to support codes with sections of fine-grain parallelism (e.g., ~ 100 lines of code in a FOR loop) separated by serial code sections. The degree of fine-grain parallelism (e.g., number of iterations of the loop) required to fully exploit the performance capabilities is moderate. In the ~ 1000 range for most applications.
  • The independence of these loops is ideal, but not required for correctness, although dependencies that limit the number of things that can be done in parallel are likely to affect performance.
  • The number of such loops is not limited and the effort for start and end loops is very low.
  • Serial code (within an MPI rank) runs very efficiently and the performance ratio of the serial to parallel capabilities is a moderate ratio of about 10x, so code that has not been completely redesigned will still work well.
  • OpenMP 5 will likely contain the constructs necessary to run the compiler for optimal performance.
  • The computing power of the nodes will increase in a similar way to the memory bandwidth, so that the ratio of storage BW to computing power will not differ significantly from that of the systems a few years ago. A little better than lately, actually.
  • Storage capacity will not grow as fast as computing power, so a key strategy will be to get more performance through parallelism with the same capacity to take advantage of the future architectures. While this capacity does not grow rapidly compared to current machines, it has the property that the storage is all high performing, which removes some of the concerns about managing multiple tiers of storage and explicit data movement.
  • The storage in a node will be coherent and all computers will be prime citizens and have equal access to all resources, storage and fabric, etc.
  • Fabric BW will increase similarly to computing power for local communication patterns, although global communication bandwidth is unlikely to increase as quickly as computing power.

Like everyone else, we’ve been trying to figure out what Intel would propose on the computer front, and Lawrence Berkeley’s NERSC has provided some insights that we think are relevant to Argonne as well. This architectural block diagram from the March 2016 update of the Edison NERSC-7, Cori NERSC-8, and future NERSC-9 and NERSC-10 machines tells the story we believe will play out :

One such chip proposed by Intel would essentially be a massively parallel Knights family processor mixed with some Xeon cores on the die to boost serial performance for selected sections of code. It would indeed be a mix on the cube of the type of hybrid Xeon-Xeon-Phi systems that many of the largest HPC centers in the world have installed – most Xeon-Phi systems have a heavy Xeon component for backward compatibility with code, that is, more computation-intensive than manageable.

This design should come as no surprise at all, and it’s a wonder why Intel is so reluctant to plan for Aurora’s A21 implementation. It’s no use, especially when everyone knows it had to rewrite the Aurora contract because whatever it did with the terminated Knights Hill processor for the original A18 machine would no longer do the market justice. We think you can blame machine learning for that. These massive machines need to be good at traditional HPC simulation and modeling as well as machine learning and various types of data analysis.

This processor design is well known in HPC systems and is in fact similar to the IBM performance-based “Cell” processor used in the first Petascale “Roadrunner” system, which was installed and commissioned between 2006 and 2008 at Los Alamos National Laboratory became 2013. The original PowerXCell8i processor used in Roadrunner had a single Power4 core and eight “synergistic coprocessors,” which were parallel processing units suitable for either arithmetic or graphics processing, while Roadrunner had six of these processors A pair of AMD Opteron processors hung over four blades, the Cell chips were designed to be a primary, standalone computing element and were often used as such in other devices. IBM had four generations of cell chips, which culminated in a variant with two fat processor cores and 32 of the coprocessors on a die that never saw the light of day and which, to be honest, Nvidia opened the door to its GPU load get through model on hybrid CPU-GPU systems.

This fat-core-skinny-core hybrid approach is also used in the SW26010 processor used in the Sunway TaihuLight supercomputer developed and in full production with the National Research Center of Parallel Computer Engineering and Technology (NRCPC) early workloads at the National Supercomputing Center in Wuxi, China. In this design there is a fat core and four auxiliary meshes made of skinny cores with 64 cores each for a total of 260 cores.

It’s not hard to guess that the Aurora A21 machine will be a mix of Xeon Fat Cores and Skinny Atom Core with incredibly juicy AVX math units. In all fairness, there is no better choice for Intel as there is no single CPU that can do everything.

Register for our newsletter

With highlights, analyzes and stories from the week straight from us in your inbox, without in between.
Subscribe now

Comments are closed.