| tags:hpc linux filesystem conference opensource storage categories:Misc
EOFS Open Filesystem Workshop 2026
This year I also attented the EOFS Workshop on Open Source Parallel Filesystems. I wrote about the last one already here on my blog and it was always a pleasure to be there. This year the EOFS organized the workshop in Paris; the first time outside of Germany.
In this blogpost I’m gonna summarize the parts the I found interesting the most:
GekkoFS
GekkoFS is a fast on demand distributed filesystem. It is using local storage, RocksDB and Mecury RPC to span a ephemeral filesystem eg. over all compute nodes participating in a job. The promise for a user is a shared filesystem with predicable performance due to no interference with other jobs as it would be on a central storage. Of course that means that all data which is needed for the job must be staged in before the work can begin and wrote back to a persistent storage at job end. Here an extra step is necessary for the users to take, cause GekkoFS cannot automatically stage in files when they’re accessed.
GekkoFS has a, in my opinion, particularily interesting architecture. The program runs completely in userspace and does not even require a mount() call, so no root privileges are ever required. To make that possible it uses LD_PRELOAD to intercept I/O related syscalls and checks if they’re related to a GekkoFS path. If so, the call is handled by the GekkoFS binaries, otherwise it is passed on to the kernel regularly.
However this design desicion also has some major caveats. First of all, if anything else in the computing job also use on LD_PRELOAD and overwrites it, it will make the filesystem crash resulting in a failed job.
An admin using it also reported, that installing and building GekkoFS is quite cumbersome. Parts of the building system seem stale and unmaintained and the bootstraping of new filesystems didn’t always succeed. GekkoFS does not support multithreaded, multinode file copying via mpiutils, although it brings its own tool todo that.
All in all is looks like GekkoFS is not really robust against inexperienced users. This might be due to the small community it has so far. Anyway I was happy to see that the project proceeds and people are actually using it.
BeeGFS
BeeGFS is my eyes kinda the cool kid on the block. They implement features the speak from sort of a long term vision which I can see need for in future.
Inspite of GekkoFS, BeeGFS Clients need a kernel module to mount the filesystem although the servers reside completely in userspace. That being said, BeeGFS has a ephermeral mode called beeOND where multiple hosts can spin up a BeeGFS with a single command
It also integrates nicly with the Linux kernel page cache where it updates cached files when they change.
One of the more futureistic features are the Remote Storage Targets. A BeeGFS can pull files from S3 endpoints, store them locally for modification and push them back to the remote target when they’re not needed anymore. This can happen asynchonosly as a background job. With the new version it is also possible to stub those files, so that only the metadata are kept locally. Since this half available state introduces a lot of desicion problems about when to report what kind of error, there are differend states that can be set like: * available * auto restore * manual restore * delayed restore (eg. for tape archives with ultra long access times)
Depending on these states the timeout and error behaviour is different to be expecable for users.
Besides that the also support IPV6 now (lol) and implemented a featureset calles Entry Migration. This gives the option to move files between targets as a background job which might be useful for hot pool emptying, rebalancing and all kinds of metadata drive lifecycle management.
I find it exciting how the people behind BeeGFS dare to implement features beyond the scope of being just a distributed filesystem to create a truly generic, modern data storage solution.
DAOS
This is one of my favorites. DAOS is a NVMe only, scale-out filesystem with a very modern design. Like GekkoFS and unlike Lustre it has no centralised metadata servers which makes it very easy to scale up and down. Since it has been aquired by HPE they are now building a product from it with the clean and simple name HPE Cray Supercomputing Storage Systems K3000
Internally they are working on lot of performance stuff to make DAOS ready for larger clusters. The biggest so far is the ALCF with almost a thousand servers and recent tests show that even 128 of them deliver a solid 5TB/s via a Slingshot network.
Another concern is that DAOS uses a lot of RAM since with the old Intel Optane Persistent Memory a lot of space was quite cheap compared to todays DRAM prices. So there goes also a lot of work into reducing the amount of memory used.
The HDF5 code is probably stale, a team member from HPE said, as not many people are using that. So be careful when using.
In the future DAOS will be also able to understand switch topologies for placement decisions of data chunks. This way the fault domains like servers or racks can be used to configure how DAOS does replication or erasure coding. A QOS feature is also coming to limit impact of background activities.
From the DAOS community I learned that also a S3 exporter is available. However is it not seen as mature for production yet. The community is working on verbs support via a OPA too. This has worked on older drivers but is not supported right now and HPE of course has no interest in spending money on this since it’s a directly competing product.
Lustre
Development of Lustre is still going on. They implement a lot of modern software defined storage features that we know from other filesystems. Highlights this year are the gradual implementation of erasure coding and the advanced features for dynamic multi-tenancy support. The latter targets especially cloud providers using Lustre in fast changing environments. Additionally to the erasure coding it’s also possible now to setup fault to management servers. Another new feature is a Trashcan where users may restore deleted files. One of the more interesting things is the hybridIO feature where Lustre automatically switches from aligned direct I/O for larger files to buffered I/O for smaller files. I wrote about this also in my post from LAD 2023.
While I certainly honor all the work that people still do in Lustre I also still feel that Lustre has a very old-fashioned architecture in most aspects and they try to amend now feature that are expected from more moderns designs. To me it seems that Lustre has been outperformed and outfeatured over the years and I predict, that it will be replaced by more convenient solutions.
Robin Hood
Robinhood has evolved a lot since I first mentioned it in the LAD23 blogpost. It’s a filesystem‑agnostic metadata catalog that works with different backends like POSIX, MongoDB, and S3. The newly released version 4 is a big improvement over version 3, offering much better performance and fixing several issues that made scaling difficult. The team is also working on rhb‑policy, which should remove the need for cron‑based scripts and make policy handling cleaner and more reliable.
lustre-db
Lustre-db solves the same problem as Robinhood. It’s a project from the german DKRZ where they have a 120 PiB Lustre Cluster and neither the admins nor the users have a overview were the active data lies.
Classic utils like Linux du takes days for the bigger projects (~200 inodes/sec.) which renders it practically unusable.
So the Lustre-db project gets the metadata informations directly a from Lustre as a binary json blob and stores it on the Lustre data servers. Reading back these files enables the software to efficiently read the necessary data way faster the it could get them from the meta data servers. The drawback is of course that the data is up to 24h old.
Users can then use a lustre-du util (up to 10 million inodes/sec.) or
use a web interface to graphically show the size distribution.
To ensure that the users can only see the meta data of their own files lustre-db stores the json blobs inside the corresponding directories with the same ACL. BeeGFS uses the same method to store fast readable file indicies as SQLite databases inside the storage.
Unfortunately lustre-db is not yet open sourced but the demand on the conference was high and it was ensured that it is planned to publicly release it.
Phobos
Phobos is a parallel, heterogeneous object store which is optimized for tape access while remaining flexible enough to integrate with a wide range of storage technologies. It exposes familiar frontend protocols such as S3 and Lustre HSM, where users can feed data into a unified object‑store layer. On the backend multiple I/O adapters are available for writing the data out, to tape, a POSIX filesystems or a RADOS backends. Phobos supports versioning semantics even for tape‑resident objects and provides a tagging mechanism that can be applied to subdirectories, tape volumes, or individual RADOS objects to improve classification and policy targeting. To apply data policies or migrate data Robinhood is used.
Its architecture is dedicated to use open formats to ensure long‑term interoperability across diverse storage environments and prevent any kind of vendor login.
Misc
Some random thoughts I heard and discussions that we had:
People start making backups on SSD instead of tapes. Apparently there has been time when they were cheap enough to buy then a enjoy the low power consumption while keeping the data online. However right now, this will probably change due to high retail prices.
Some guys who work in the storage area for three decades already told during a Q&A session that burst buffers were a thing once. Meaning that workload which write a lot in short time, can use some kind of write-back cache and the filesystem takes care of putting all that data into the cluster afterwards. But, for good reason, they all disappeared over time in the public discussion. Everyone trying to implement such a feature complained about the headache it gives when your have to decide how to ensure data integrity and cache coherence. GPFS is one that still has a feature that can do that. And Lustre can also do a persistent client side caching as well as a Metadata Writeback cache. Everyone agreed that best way to handle something like this is if applications are aware of the caching layers and no generic “magic” happens under the hood. I guess the approach that BeeGFS took with writing to a local distributed filesystem and then have proper command to write back to a remote target in a well-defined way is a good compromise.
Closing thoughts
I enjoyed the conference very much even though the venue was bit shabby this year. But the small community feels still super friendly and reachable and it’s amazing how diverse the technical backgrounds and perspectives of the different people are. E.g. one guy was working for the edf in France where they use HPC to do analysis on 50 years of collected data from there nuclear power plants %) All in all, definitely worth for me to go, especially since there are only pure technical talks.