HPC matters ::: Knowledge- & Technologytransfer

Scalable I/O For Extreme Performance

Applications running on HPC systems use a file system to do their I/O. This mostly consists of the initial read of the input data, the periodical storage of checkpointing information to restore the execution state from in case of unexpected program termination, as well as of the eventual writing of the actual output data of the application.

In order for the I/O operations not to become the scalability bottleneck of HPC applications, the file system and I/O infrastructure must keep pace with the increasing performance and number of computing cores present on HPC systems. In this context, a global optimization of the file system turns out to be very difficult or impossible. In part due to the disparate nature of the requirements and expectations of different user groups, and in part because currently there is no way to identify abnormal I/O behavior and trace it back to its source.

SIOX' main goal is to gain an overview of all the I/O activity taking place on a HPC system, and to use this information to optimize it. Initially, the project's scope spans the development of standardized interfaces to collect, reduce, and store performance data from all relevant layers. This information will then be analyzed and correlated with previously observed access patterns in order to gain an understanding of the characteristics and causal relationships of the system.

This knowledge will be the starting point for subsequent performance optimizations aimed at specific users and applications, carried out through e.g. the automatic tuning of Open MPI or file system parameters. Such use-profiles are going to be continuously created and not only helpful for optimization, but also when diagnosting acute performance problems, or when planning new aquisitions. In the course of the project, an holistic approach for I/O analysis should be conceived, implemented and applied. While SIOX's applicability is oriented towards HPC environments, it shouldn't be constricted to them. In this way, the integrated analysis of applications, file systems, and infrastructure could also be used for the future optimization of other scenarios e.g. the design of file system caches for mail servers.

Dr.-Ing. Michael Kluge

Administrative contact

Zentrum für Informationsdienste und Hochleistungsrechnen
Zellescher Weg 12
01069 Dresden