In Octobre 2023 I attended the LAD23 in Bordeaux: the Lustre Administration and Development workshop. I was working on a ~15 PB (raw capacity) storage system for a University which was based on the Lustre filesystem and my colleague and I wanted to get some information since we both didn’t any experience with Lustre yet. It was also my first conference in person since the FOSDEM20.
Conference and Organization
The conference was organized by the EOFS took place in the InterContinental Hotel in the city centre of Bordeaux and the Social Event on the first evening was at Château Pape Clément in the vinyards besides the city. Both places were a bit posh but served good wine and catering and featured nice decor.
Lustre News & Roadmap
A big part of the conference is the current state of Lustre development. Traditionally the first talk is a presentation of the Lustre roadmap and upcoming features Most interesting for us is the Erasure Coding functionality which is planned to be available in Lustre 2.17, expected at the end of 2024. This allows a user to get data redundancy without using a RAID configuration below the Lustre system, resulting in more flexibility.
Many of the following talks were then Lustre developers presenting their work and users explain how they found bugs and how to fix them. Here is a list of the things I found most interesting which is of course just an excerpt:
Lustre & Kerberos: A lot of work is done here mainly to allow the integration of lustre in Enterprise/Cloud environments with different users and multi tenancy support. With Kerberos based authentication it is also possible to prevent unauthorized Lustre servers from joining the cluster. Another problem that can be addressed is the fact that Lustre trusts the root user on all clients per default. If an attacker becomes root on one client he can read all data. (This problem can also partially be mitigated by using
nodemap) The whole Kerberos related code got a major rewrite and a lot of bugfixes coming in Lustre 2.16.
Other people are working on Client Side data compression which directly compresses data on the client before it is send over the network. This solution does not depend on ZFS compression on the server which is a common usecase today when Lustre is used with ZFS. The main question about this was if the OSD is able to decompress the stored data and then only serve the parts that were requested by the client and not the whole compressed block which could lead to traffic amplification.
One of the most impressive announced features was: Unaligned Direct I/O Lustre tries to combine the advantages of buffered I/O and direct I/O from 2.16 on. It automatically uses direct I/O for bigger files and the page cache for smaller files. The alignment for bigger files is done using a aligned buffer in kernel memory which is way faster than using a cache and only little slower than direct I/O. This may increase throughput drastically without the necessity to change the user application.
lljobstatis a new tool to debug slow I/O on MDTs/OSTs which has profen very useful according to the developers.
Better fscrypt support to move data without knowing the key which had some pitfalls before. This will be released even within Lustre 2.15
The Lustre filesystem is also being prepared for larger devices, with up to 1.5PB per OST
And support for Hybrid setups (NVMe+HDD) in ldiskfs is evaluted
Robin Hood Filesystem Utilities
I never heard of the robinhood software suite prior to this event. Librobinhood is an efficient C-API to store and query any filesystem’s metadata in a efficiant way. This is usually done by loading the metadata into a MongoDB database for searchablity. The tools support POSIX, Lustre and MongoDB as backend and allows fast searching, filtering and changing metadata.
What I really liked in their talk this year is that they implemented a expire date for files
which is currently working only on Lustre backend but general POSIX support is planned.
The date is saved as a user visible extended attribute and can later be found
with the robinhood
Besides that the tool can also have complex filters or search for other extended attributes.
Here are some further random notes and learnings I took during the conference.
LNet Network Selection Policy (UDSP) allow to the priority of LNet links. This can result in equal/weighted loadbalanching between LNet devices. Also it’s possible to use specific links only if no other is available (fallback). See
lnetctl set heathand
lnetctl set priority
The Linux kernel may split a eg. 16M IO operation into smaller IOs. One can use
perfto check if submitted IOs are the same as what is iactually send to disk.
One result from a benchmark analysis talk of HPE: ZFS compression underneath Lustre is worth it and doubled throughput (in Benchmarks, at best case).
Personally I realized during the conversations on this conference which was my first HPC related First that after getting used to HPC workloads everything else seems too small in size. People here are talking about a 800TiB NVMe + 1.5 PiB HDD storage as the “playground” where their users can test things and it’s rather common that people have to move 10-50 PiB form one storage to another.
Secondly benchmarking is a important part of storage engineering, besides fio which I knew there is also ior, io500, mdtest. Benchmarking should be a standard thing to do for a storage engineer for having a clear idea what to expect from hardware before, and realizing misconfiguration during production use.
As Sergey Kachkin said: “Storage is developed with benchmarks, tested with benchmarks, sold with benchmarks, only users actually have workloads”