Funny story about Bronson Messer (quoted in the article):
On my first trip to Oak Ridge we went on a tour of “The Machine”. Afterwards we were hanging out on the observation deck and got introduced to something like 10 people.
Everyone at Oak Ridge is just Tom, Bob, etc. No titles or any of that stuff - I’m not sure I’ve ever heard anyone refer to themselves or anyone else as “Doctor”.
Anyway, the guy to my right asks me a question about ML frameworks or something (don’t even remember it specifically). Then he says “Sorry, I’m sure that seems like a really basic question, I’m still learning this stuff. I’m a nuclear astrophysicist by training”.
Then someone yells out “AND a three-time Jeopardy champion”! Everyone laughs.
You guessed it, guy was Bronson.
Place is wild.
Now I get to tell him this story next time I see him :).
Reminds me of the t-shirt I had that said, "Ok, Ok, so you've got a PhD. Just don't touch anything."
It's just not in the culture, assuming you mostly work among other PhDs.
There were quite a few people who insisted on using Doctor when referring to others, calling themselves Doctor, etc.
He never did but I experienced it quite a bit.
Asking for a friend. The friend is me. I desperately want that shirt (assuming it's well designed).
It will complement my "Rage Against The Machine Learning" shirt that I'm wearing right now.
It's much easier and cheaper than in the past to just go to Ali-express and find a shop with good feedback and cotton shirts, upload a graphic (even if it's just text in a specific font and size), and wait a month. I usually pay around $10 each.
You typically write a bash script with some metadata in rows at the top that say how many nodes, how many cores on those nodes you want, and what if any accelerator hardware you need.
Then typically it’s just setting up the environment to run your software. On most supercomputers you need to use environment modules (´module load gcc@10.4’) to load up compilers, parallelism libraries, and software, etc. You can sometimes set this stuff up on the login node to to try out and make sure things work, but generally you’ll get an angry email if you run processed for more than 10 minutes because login nodes are a shared resource.
There’s a tension because it’s often difficult to get this right, and people often want to do things like ´pip install <package>’ but you can leave a lot of performance on the table because pre-compiled software usually targets lowest common denominator systems rather than high end ones. But cluster admins can’t install every Python package ever and precompile it. Easybuild and Spack aim to be package managers that make this easier.
Source: worked in HPC in physics and then worked at a University cluster supporting users doing exactly this sort of thing.
90% of my interactions are ssh'ing into a login node and running code with SLURM, then downloading the data.
You typically develop programs with MPI/OpenMP to exploit multiple nodes and CPUs. In Fortran, this entails a few pragmas and compiler flags.
This will have much of what you need.
It's like kubernetes but invented long ago before kubernetes
Is it really realistic to assume that this is the "fastest supercomputer"? What are estimated sizes for supercomputers used by OpenAI, Microsoft, Google etc?
Strangely enough, the Nature piece only mentions possible secret military supercomputers, but not ones used by AI companies.
AI models are trained using one of {Data parallelism, tensor parallelism, pipeline parallelism}. These all have fairly regular access patterns, and want bandwidth.
Traditional supercomputer loads {Typically MPI or SHMEM} are often far more variable in access pattern, and synchronization is often incredibly carefully optimized. Bandwidth is still hugely important here, but insane network switches and topologies tend to be the real secret sauce.
More and more these machines are built using commodity hardware (instead of stuff like Knight's Landing from Intel), but the switches and network topology are still often pretty bespoke. This is required for really fine-tuned algorithms like distributed LU factorization, or matrix multiplication algorithms like COSMOS. The hyperscalars often want insane levels of commodity hardware including network switches instead.
The AI supercomputers you're citing are getting a lot closer, but they are definitely more disaggregated than DoE lab machines by nature of the software they run.
I think the main difference is in the target numerical precision: supercomputers such as this one focus on maximizing FP64 throughput, while GPU clusters used by OpenAI or xAI want to compute in 16 or even 8 bit precision (BF16 or FP8).
Google style infrastructure uses aggregation trees. This works well for fan out fan back in communication patterns, but has limited bisection bandwidth at the core/top of the tree. This can be mitigated with clos networks / fat trees, but in practice no one goes for full bisection bandwidth on these systems as the cost and complexity aren't justified.
HPC machines typically use torus topology variants. This allows 2d and 3d grid style computations to be directly mapped onto the system with nearly full bisection bandwidth. Each smallest grid element can communicate directly with its neighbors each iteration, without going over intermediate switches.
Reliability is handled quite a bit different too. Google style infrastructure does this with elaborations of the map reduce style: spot the stranglers or failures, reallocate that work via software. HPC infrastructure puts more emphasis on hardware reliability.
You're right that F32 and F64 performance are more important on HPC, while Google apps are mostly integer only, and ML apps can use lower precision formats like F16.
You're spot on that the bandwidth available in these machines hugely outstrips that in common cloud cluster rack-scale designs. Although full bisection bandwidth hasn't been a design goal for larger systems for a number of years.
Bisection bandwidth is the metric these systems will cite, and impacts how the largest simulations will behave. Inter-node bandwidth isn't a direct comparison, and can be higher at modest node counts as long as you're within a single switch. I haven't seen a network diagram for LambdaLabs, but it looks like they're building off 200Gbps Infiniband once you get outside of NVLink. So they'll have higher bandwidth within each NVLink island, but the performance will drop once you need to cross islands.
Oh wow, that’s pretty bad.
Within the node, bandwidth between the GPUs is considerably higher. There's an architecture diagram at <https://docs.olcf.ornl.gov/systems/frontier_user_guide.html> that helps show the topology.
It’s the fastest publicly disclosed. As far as private concerns, I feel like a “prove it” approach is valid.
ORNL and its partners continue to execute the bring-up of Frontier on schedule. Next steps include continued testing and validation of the system, which remains on track for final acceptance and early science access later in 2022 and open for full science at the beginning of 2023.
UT-Battelle manages ORNL for the Department of Energy’s Office of Science, the single largest supporter of basic research in the physical sciences in the United States. The Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit energy.gov/science
[0] https://www.ornl.gov/news/frontier-supercomputer-debuts-worl...
Two things I’ve always wondered since I’m not an expert.
1. Obviously, applications must be written to run effectively to distribute the load across the supercomputer. I wonder how often this prevents useful things from being considered to run on the supercomputer.
2. It always seems like getting access to run anything on the supercomputer is very competitive or even artificially limited? A shame this isn’t open to more people. That much processing resources seems like it should go much further to be utilized for more things.
One of the main differences between supercomputers and eg a datacenter is that in the former case, application authors do not, as a rule, assume hardware or network issues and engineer around them. A typical supercomputer workload will fail overall if any one of its hundreds or thousands of workers fail. This assumption greatly simplifies the work of writing such software, as error handling is typically one of the biggest, if not the biggest, sources of complexity a distributed system. It makes engineering the hardware much harder, of course, but that’s how HPE makes money.
A second difference is that RDMA (Remote Direct Memory Access—the ability for one computer to access another computer’s memory without going through its CPU. The network card can access memory directly) is standard. This removes all the complexity of an RPC framework from supercomputer workloads. Also, the L1 protocol used has orders of magnitude lower latency than Ethernet, such that it’s often faster to read memory on a remote machine than do any kind of local caching.
The result is that the frameworks for writing these workloads let you more or less call an arbitrary function, run it on a neighbor, and collect the result in roughly the same amount of time it would’ve taken to run it locally.
HPC applications were driving software checkpointing. If a job runs for days, it's not all that unlikely that one of hundreds of machines fails. Simultaneously, re-running a large job, is fairly costly on such a system.
Now, while that exists, I don't know how typical this is actually used. In my own, very limited, experience, it wasn't and job-failures due to hardware failure were rare. But then, the cluster(s) I tended to were much smaller, up to some 100 nodes each.
Here in Finland I think you can use LUMI supercomputer for free. With a condition that the results should be publically available
I'm surprised that Frontier is free with the same conditions; I expected researchers to need grant money or whatever to fund their time. Neat.
At some point, in order to concurrently allocate a 1000-node job, all 1000 nodes will need to be briefly unoccupied ahead of that, and that can introduce some unavoidable gaps in system usage. Tuning in the "backfill" scheduling part of the workload manager can help reduce that, and a healthy mix of smaller single-node short-duration work alongside bigger multi-day multi-thousand-node jobs helps keep the machine busy.
As others note, these systems are measured by running a specific benchmark - Linpack - and require the machine to be formally submitted. There are systems in China that are on a similar scale, but, for political reasons, have not formally submitted results. There are also always rumors around the scale of classified systems owned by various countries that are also not publicized.
Alongside that, the hyperscale cloud industry has added some wrinkles to how these are tracked and managed. Microsoft occupies the third position with "Eagle", which I believe is one of their newer datacenter deployments briefly repurposed to run Linpack. And they're rolling similar scale systems out on a frequent basis.
"Cloud" doesn't mean much more than "computer connected to the Internet".
HPC workloads are often focused on highly-parallel jobs, with high-speed and (especially) low-latency communications between nodes. Fun fact: In the NVIDIA DGX SuperPOD Reference Architecture, each DGX H100 system (which has eight H100 GPUs per system) has four Infiniband NDR OSFP ports dedicated to GPU traffic. IIRC, each OSFP port operates at 200 Gbps (two lanes of 100 Gbps), allowing each GPU to effectively have its own IB port for GPU-to-GPU traffic.
(NVIDIA's not the only group doing that, BTW: Stanford's Sherlock 4.0 HPC environment[2], in their GPU-heavy servers, also uses multiple NDR ports per system.)
Solutions like that are not something you'll typically find in your typical cloud provider.
Early cloud-based HPC-focused solutions centered on workload locality, not just within a particular zone but with a particular part of a zone, with things like AWS Placement Groups[3]. More-modern Ethernet-based providers will give you guides like [4], telling you how to supplement placement groups with directly-accessible high-bandwidth network adapters, and in particular support for RDMA [4] or RoCE (RDMA over Converged Ethernet), which aims to provide IB-like functionality over Ethernet.
IMO, the closest analog you'll find in the cloud, to environments like Frontier, is going to be IB-based cloud environments from Azure HPC ('general' cloud) [5] and specialty-cloud folks like Lambda Labs [6].
[1]: https://docs.nvidia.com/dgx-superpod/reference-architecture-...
[2]: https://news.sherlock.stanford.edu/publications/sherlock-4-0...
[3]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placemen...
[4]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html
[5]: https://azure.microsoft.com/en-us/solutions/high-performance...