Since I work in maintaining a smaller HPC cluster at the moment I had the chance to attend the ISC Conference for the first time. Here is the summary of my notes during the conference and exhibition.

The event overall

Overall, I found the exhibition booths to be the most interesting part of the event. They provided a great opportunity to connect effortlessly with various manufacturers and colleagues. The number of sales reps was pleasantly low (apparently, most of them were attending a conference in Paris), allowing me to engage directly in technical discussions almost everywhere.

The presentation program, on the other hand, was rather disappointing to me. Much of the HPC research is simply too advanced for me, and the talks and panels at my level often lacked quality.

However, the meetings between our organization and the manufacturers were truly interesting. The atmosphere was refreshingly friendly—especially when my boss announces that he plans to spend a few millions on a new computing cluster! The manufacturers take the time to address questions on all levels, making the discussions very insightful. This format was new to me, and I managed to gain a lot of valuable knowledge from it.

Cornelis Omnipath

I have been quite skeptical about Intel OPA so far, but I must say the folks at Cornelis really won me over. At least from my perspective, they were technically very competent and presented exciting concepts. According to their benchmarks, they are, of course, better than Infiniband when it comes to low latency and especially high throughput of small messages. I also found their plans for the next generation truly fascinating. They plan their ASICs to support UltraEthernet by default, where the actual Layer 2 protocol will then be negotiated via LLDP. This means that on a physical card, for example, one VLAN can communicate via OPA, while another uses Ethernet. Looking at the roadmaps of various manufacturers, it is clear that everyone is aiming for UltraEthernet support. So, we can stay curious about what will happen in the next few years when it finally becomes available.

Information about current OPA products: Cornelis Networks
Information about UltraEthernet features: UltraEthernet

IBM Storage Scale

As is customary for part-time gods, signing an NDA was required here. In short, IBM sticks with it’s all-in-one approach and continues to offer all the features others have, while also ensuring compatibility with all other formats and protocols. All of this is part of the GPFS Spectrum Storage Scale System product catalog. %) One point they mentioned that I found interesting: Apparently, more and more of their HPC customers are using applications that want S3 as a storage protocol.

Ceph

There were also a few exhibitors regarding Ceph.

  • https://www.croit.io/ from Munich offers an all-inclusive software solution for installing PXE-booting Ceph and DAOS nodes with a web GUI and all the bells and whistle Of course also provides Ceph support and consulting.

  • https://www.clyso.com/eu/de/, who also provide support for GWDG, were present as well, though they dodged the question of where Ceph has already been successfully used in an HPC context. Instead, they offered to drop by to do some consulting ;-)

Colleguages who are running Ceph already in an HPC context emphasized: You can run some impressive benchmarks with Ceph and scale out quite broadly, but a cluster configured this way with a high number of metadata server is very demanding and cumbersome in everyday operation. That’s why they’ve significantly scaled back for daily use. They are still very pleased with the Clyso support, but much of the collaboration involves finding and fixing bugs that emerge due to their unusual cluster setup.

Slurm & Slinky

SchedMD gave a big presentation of their new Slurm version. The REST API and Kubernetes integration (Slinky) were highlighted. Jobs can now be submitted via K8s deployments manifests, which then run on traditional Slurm compute nodes, or the other way around — submit jobs via sbatch that then run on K8s workers. Or mix both. To make this work, they replaced a large portion of the Kubernetes scheduler with Slurm. The interesting point here is that the folks at SchedMD report more and more users who have never launched HPC/AI workloads on a traditional login node, but only know the cloud-native methods. Another notable takeaway is that they discovered Kubernetes’ current resource management — especially around hardware pinning — is quite inadequate for HPC demands. So, some development work will likely flow back into the Kubernetes community here.

Slurm 25.05: slurm.schedmd.com/release_notes.html

Quobyte

A former colleague of mine works there now, so I spent quite a bit of time with them, and the team was very pleasant and friendly. They do seem to be making progress, which surprised me considering what I saw of their product a few years ago. They now appear to have several customers with HPC-like requirements, but I would still be worried that their product ends up combining the downsides of Ceph and Scale.

Conclusion:

Hardware continues to get faster as usual, cloud technologies are making deeper inroads into HPC, and ##everyone## — even those who just sell hard drives — are AI specialists now, naturally 🙂 However, I thought the event was great and I would attend again.