In 2024 I first attened the EOFS workshop about opensource parallel filesystems. The EOFS organizes this meeting once a year with different partners from german Univeristies to gather people and talk about the current status of open, as in non-proprietary, filesystems. I found this mini conference quite interesting and inspiring and it was one of the reasons I ended up working in HPC.

So I’m happy that I have the opportunity to participate again this year in Mainz and I will summarize the things that I find interesting in this blogpost.

Day 1 (27. Feb. 2025)

Ceph as HPC storage

One participant explained that they are going to use Ceph for the /home/ of their users. This is particularly interesting because Ceph is a true linear scale out system on commodity hardware which has a relly large community and is extremely flexible. They measured ~20GiB/s with a thousand spinning disks in their benchmarks and expect a theoretical maximum of ~55GiB/s Since they expect more workloads coming from cloud environments they also expect more users to ask for S3. Therefore they will also start to put some of their medium I/O intensive workloads on the NVMe based part of the Ceph cluster as well.

Lustre still standing

Lustre seems to be the main champion of the HPC parallel filesystems. Although a lot of community members complain about the bad documentation and unclear configuration, it still seems to provide best performance, especially if one knows what I/O pattern to exepect and configures the striping accordingly.
A fancy new feature that’s on Lustres roadmap is the Metadata Writeback Cache. With that in use a client with a workload with many file creations can write to a local memFS and the writeback to the Lustre storage server will be done if someone else tries to access that tree (or after some time).
Multiple other admins complained that they have Lustre in use in their HPC clusters for decades and somehow try to use it as a one-filesystem-fits-all solution with just a tape library for archiving. However, different clusters and usergroups definitely have different demands and you cannot tune your filesystem to serve all of them best. In the end, maintaining and, even more, upgrading those clusters gets closer to impossible every year. I myself always felt kinda lost when working with lustre, and I don’t think the way it is administrated is intuitive or modern in any way. I also don’t think that adding more an more features (some very complex) will help with that matter. If there isn’t a real pressure to use Lustre for performance or compatibility reasons I would always try to pivot to something else.

Daos still rising

Apparently Daos withstands the discontinuation of Intels Optane persistent memory pretty well! The whole development team was just moved from Intel to HPE and they seem to be quite committed to keep the development ongoing. Also the alternative mode where they replace the persistent memory with a combination of DRAM and NVMes seems to work out as hoped. Daos is to me the most exciting project in this area since it’s written specifically for modern storage technologies with the goal of providing I/O API agnostic datastores. Also it’s very well documented and has intuitive tooling, making it easy to deploy and use.

BeeGFS

Somewhere in between Lustre and Daos there is BeeGFS. Although I never used it, it has some interesting features like the on-demand (Beeond) mode where the distributed filesystem is created dynamically on all nodes that run a job. They also implement remote storage targets where the filesystem can asynchronously write back to a remote S3 endpoint but the machine can still do I/O to the LAN local BeeGFS. The information about to which remote target a file should go, is stored in the metadata meaning that it will be an intrinsic part of a file in BeeGFS. Also the work on there generic POSIX->POSIX copy tool which is able to do large file transfers in parallel. This tool is not only capable of running multithreaded but also cpoy data with multiple nodes and of course can resume aborted transfers. I like that they build this tooling not for their own filesystem but as general easy to use solution replacing eg. MPI based copy mechanisms or simple rsync.

Filesharing with Globus

Another colleague also talked about the pain of copying multiple terabytes around the globe. They recently joined the Globus service which is basically a filebrowser application hosted by the University of Chicago. It allows users to create Globus Connect servers in their own network which have access to a large storage cluster and then link their SSO solution for user login. Globus then manages federated login and authorization on published datasets and, if wanted by the user, initiates peer to peer filetransfers. The idea is great but they just turned it into a closed source, paid product with lots of plugin subscriptions. I would rather like to see a federated approach here with software maintained by the community.

Overall the first day was more than exciting for me. The topic overlap between development and operations is just right for me and the people at this event are very friendly and approachable. It turned out that in HPC contexts POSIX style filesystems are still used in most of the cases. This is for example not the case anymore for software that’s meant for running as cloud services. Also the attempt to build insanely large filesystems and do everything at once an them seems to finally fall apart. We can see a lot of people investing in splitting clusters into specialized smaller ones or at least introduce multiple (more than 3) storage tiers. However tiering means copying and here the world still lacks some common denominator. In 2025 it is still an issue copying large files from one system to another %)

Day 2 (28. Feb. 2025)

SmartNICs

One of the most interesting talk today was about using smartNICs together with distributed filesystems. In this case people used a Nvidia BlueFlied 3 to seamlessly encrypt or compress data written to a Lustre filesystem. The data is mangled while being in the buffer of the network card by one of its RISC-V CPUs. They also argue this increases security because the code cannot be access by the host system. Although this is super interesting and crazy bleeding edge technology, it is also very clear that the usecase it artificial and the usage is immensely complicated especially if you don’t want to introduce latencies. However another colleague told us, that VAST uses those kind of cards to build storage controllers that consist of only a PCIe switch+backplane and a set of BlueField cards. This sounds all very exciting but to be honest, it also makes computer more complicated and shifts the same problems to another part of your machine.

Misc topics

Some of the notes I took about topic which aren’t exactly in my interest:

  • Fraunhofer Institut develops a distributed memory+storage management called GPI
    • It’s a alternative to MPI
    • They let the application use batches which enables AI workloads for massive random access without bottlenecking in a metadata server
  • People researching on getting AI workloads into the h5bench benchmark
  • JULEA is a framework for creating distributed filesystems and has be in development for ~15 years already
  • EU rules force the public sector to create solicitations on existing technology that means the technology level used always lacks multiple years behind

Explainable I/O

One Research group is working on visualizing I/O from the submission at application level down to the actual physical write. The main idea is to implement traceability to a I/O requests so that they can be observed through the whole stack. It is currently being implemented based on eBPF and systemtap and is still heavily under development. Currently they’re still trying to get it to work properly for local I/O first but they’re saying that they managed to do it with an overhead of just 1.3% which would be alright for debugging.

We also had discussions about the topics. On the second day they were mostly focused on the state of HPC in German education. Many attendants from the universities explained how difficult it is these days to find student who are interested in storage, filesystems or even operating systems. Same goes for any kind research proposal, without something with machine learning or at least energy saving in the title there is almost no interest, means no funding.

I guess it will take a while till the AI bubble bursts and till then we’re going to see a lot of horrible I/O patterns. To me this is actually not so much a bad perspective. A change of conditions usually triggers invention and develops alternative paths to well established solutions.