Exascale Exasperation: Why did Intel give Intel a second chance? Can Nvidia GPUs Drive to Rescue Aurora?

The most debated topic in HPC right now – another delay in the Intel chip, and with it the delay in the US flagship Aurora Exascale system – is something nobody wants to talk about directly. Not the Argonne National Laboratory, where Intel was supposed to install Aurora in 2021; not the Department of Energy’s exascale computing project, which is leading the development of a “capable exascale ecosystem”; and don’t do it yourself. As for Intel, a spokesman earlier this week promised to “return shortly,” but had not yet done so at press time.

Instead of information (aside from red answers from public relations Q&A released three weeks ago by Intel and Argonne) the HPC community has to speculate about how those responsible for Aurora came up with this fix and what it does for the US -Exascale strategy means and what DOE and Intel can do about it.

As a background, Intel announced on July 23 that the 7 nm Ponte Vecchio GPU, which is to be integrated into Intel Xeon CPUs in Aurora, will be delayed by six months. The general reaction could be described as shocked but not surprised.

Shocked at the high strategic importance of Aurora to US exascale efforts, a multi-year, multi-billion dollar project that will be vital to US technological performance versus its planned major geopolitical competitor China next year along with the EU and Japan build an exascale system.

No wonder that Intel made a mistake with 10 nm and 7 nm errors and did not ship the Aurora (A18) system to Argonne two years ago before the Exascale (for more background information, see “Another Intel 7 -nm chip delay – what does that mean? Aurora Exascale? ”).

“For the past five years, Intel has created a plague of delays and shutdowns among HPC users,” said industry analyst Addison Snell, CEO of Intersect360 Research. “Aurora was originally intended to be a pre-exascale system in 2018, but had to be redefined after Intel canceled the Xeon Phi processor and the OmniPath connection. Intel had to deliver Ponte Vecchio and Aurora on time and according to specification (using revised definitions of both terms) to protect the face. If Argonne has to wait for shipment until 2022, this exascal supercomputer will become an embarrassing afterthought overshadowed by Frontier and El Capitan systems elsewhere within the DOE, let alone what’s being done outside of the US. “

Swap Nvidia for Intel GPUs?

Snell suggested that in the absence of Intel’s 7nm Ponte Vecchio GPU, Nvidia GPUs should be paired with Intel CPUs in Aurora.

“At this point in time, the best solution for Argonne, for US exascale efforts, and for the US taxpayer, could be for Intel to eat crow and Nvidia to open the door to provide Aurora’s GPU components,” said Snell. “Nvidia is miles and miles the leading GPU provider. Nvidia has proven successful with the CORAL systems before the Exascale and would give the DOE the ability to keep optimizations in CUDA without relying on alternative GPUs from Intel or AMD. “

“Intel made a big bet on becoming the primary vendor for this contract,” added Snell, “and that comes with the responsibility to deliver.” The DOE should insist on delivery in mid-2021, even if that means Intel will have to use someone else’s GPUs to meet the terms of the contract. Whatever Argonne and the DOE decide, it should be clear that Intel has exhausted its second chances with Aurora. “

Snell argued that there is no compelling technical reason why both CPUs and GPUs need to be Intel parts.

“AMD offers consistency advantages in combining AMD CPUs and GPUs over a common infinity fabric. However, we haven’t heard the same from Intel with its own processors and the proposed Ponte Vecchio GPU,” he said. “If Intel takes OneAPI (cross-architecture programming model) seriously, it shouldn’t matter which GPU it is from a programming perspective.”

Karl Freund, senior analyst for machine learning and HPC at Moor Insights & Strategy, however, doubted that swapping Nvidia for Intel GPUs could work.

“Here’s the problem. The common thread in all three US DOE exascale deployments is that they are tightly integrated CPU-GPU complexes,” he said. “So the GPU isn’t on a PCIe card … It’s not like an APU that you find in a laptop where the CPU-GPU is actually on the same package, but it uses the same concept, the native SMP- Structure of The CPU speaks directly to the GPU. This applies to both the Ponte Vecchio-Xeon and AMD’s next-generation Radeon (GPU) -Epyc (CPU). “

Replacing Nvidia with Intel GPUs would mean higher latencies and slower performance, according to Freund, since Xeon “only speaks PCIe to an Nvidia connection”.

The result: DOE and Argonne have no choice but to stick with their best pick, Intel.

“You just have to stay on course and realize that you (Argonne) will not be the first exascale,” Freund said. “And bigger exascales (Frontier in Oak Ridge National Lab and El Capitan in Lawrence Livermore National Lab) will beat them with AMD technology. So they (Argonne) could say, “Well, okay, let’s step back and reshape Aurora to get bigger?” Because when you’re third – let’s say Aurora is going to be DOE’s third exascale – it’s not very interesting – and it’s slower than your first two. That’s not cool.”

Questions about Intel

Intel’s mistakes not only challenged the company’s ability to execute, but also the stability of the management team involved in HPC technologies. Leaving executives include:

  • Alan Gara, who led the development of Intel’s highly advertised high-performance OmniPath fabric, retired last year
  • Raj Hazra, former Corporate VP / GM, Enterprise & Government, Data Center Group, left Intel in November and is now with Micron
  • Charles Wuischpard, former Vice President of the Intel Datacenter Group and now CEO of Ayar Labs
  • Daniel McNamara, formerly Intel’s President / GM of the Network and Custom Logic Group and SVP of the Programmable (i.e. FPGA) Solutions Group, who is now with AMD
  • Diane Bryant left her role as group president of the Data Center Group more than three years ago

Speculation is spreading that further changes may be on the way.

“They seem to be disintegrating, perhaps reconstructing, but disintegrating what their flagship team was for developing trans-exascale machines,” said a leading HPC agency. There are some lower level people and I don’t know if they were announced or not. And I think there are a few cases in the air. “

A milder view of Intel’s Aurora-related challenges is provided by industry analyst firm Hyperion Research, whose senior adviser, HPC Market Dynamics Steve Conway, told us in July that if Aurora ships in late 2021 or 2022, it won’t be much of a delay – but still a delay. “He also downplayed Intel’s difficulty in fixing the 7nm process that was causing the Ponte Vecchio delay.

Conway colleague Bob Sorensen, Senior VP of Research at Hyperion, argued that in building advanced systems, you can expect some delays in top-of-the-line supercomputers. While he agreed that “it’s pretty clear that no one at Intel wants to be the guy to write the press release that says, guess what, we’re showing some delays,” he also said, “we’re in one place In semiconductor manufacturing, this is no longer just engineering where you turn the crank to get to the next node. “

“New technology is required at every step now, so it’s difficult, really difficult,” said Sorensen. “I’m not defending Intel, and I’m not saying that nobody upset anything here. But for me these are pretty aggressive systems, new architectures. There should be some allowance for slips in the schedule. If every machine ran like clockwork, as it should actually be, you would have to ask yourself the question: Are we advancing the state of the art sufficiently here? Or have we become too risk averse in our new architecture? … That is the nature of the animal when you press the envelope. “

However, Conway and Sorensen’s view of Intel appears to be a minority. A source told us that a senior DOE official involved with Exascale told colleagues he was “never this angry”. This is partly due to the second chance Intel gave after the Aurora failure before the Exascale. But here, too, there is an opposing view that Intel Aurora A21 was granted because AMD, in cooperation with Cray, won two of the other three first exascale contracts from DOE

“DOE never liked putting all of our eggs in one basket,” said friend of Moor Insights, “just like when they switched back and forth between IBM and Intel and then between Nvidia.” I think some of that was at play here, where they had already foreseen that AMD would get the other Exascales even though Argonne had already been given to Intel. They said we couldn’t put all of our eggs in the AMD basket, that’s so risky too. And it’s not good for US industry to stifle competition like this. I think they probably had both practical and altruistic goals in taking the (A21) at Intel. “

Indigenous technology

Another advantage of Intel: the growing value that leading technology countries place on “indigenous technology”. While Intel is a domestic factory, AMD is outsourcing chip manufacturing to TSMC in Taiwan. However, that benefit was undermined last month by Intel’s disclosure that part of its production of 7nm chips could be outsourced to a third-party semiconductor foundry – presumably TSMC (Samsung is another option), their 7nm CPUs and GPUs for AMD have indicated this company price / performance means that Intel can take the market share of HPC and data center processors.

But in the final turn, TSMC announced in May that it would build a factory in Arizona, which could have an impact on the growing rivalry between the US and China.

“There are concerns about the lack of indigenous competitive advanced knot processes,” Freund said. “I mean, Intel can’t. Samsung can, their factory in Austin, and now that TSMC is building a facility in Arizona, TSMC could have domestic manufacturing capabilities as well, although options still exist.

“But there are real concerns as US-China relations continue to deteriorate,” Freund said. “Should China make a significant mistake and try to isolate Taiwan, which would require major military action, the US is in danger. We are absolutely exposed at the moment. And Chinese leaders are smart guys. They know that. So if we push hard enough, they might just push back. “

