HOPSA - the joint RF/EU project: HOlistic Performance System Analysis

Joint project of Russian and European organizations

To maximise the scientific and commercial output of a high performance computing system, different stakeholders pursue different strategies. While individual application developers are trying to shorten the time to solution by optimising their codes, system administrators are tuning the configuration of the overall system to increase its throughput. Yet, the complexity of today's machines with their strong interrelationship between application and system performance presents serious challenges to achieve these goals. Our experience indicates that even aspects like job scheduling or managing quotas can often lead to performance degradation.

All the reasons mentioned above indicate the demand for a specific tool that would allow seeing where and, what is more important, why does the performance loss happen. It is absolutely clear that such tool should consider a variety of special data on job behavior and supercomputer status to provide high-quality and versatile estimates. The tool of this type is being developed in the Laboratory of Parallel Informational Technologies of Research Computing Center of Lomonosov Moscow State University. It was named LAPTA, which stands recursive for “Lapta is a pAckage for Performance moniToring and Analysis”. LAPTA system is designed for versatile analysis of supercomputer parallel programs dynamic characteristics. The characteristics from submitting the job to its finish are taken into consideration during such analysis. This allows obtaining full information both on a single job and on all simultaneous jobs running on the system.

Data from various hardware sensors is being collected by agents and is passed to aggregation level on cluster nodes. Agents collect data from the Resource Manager as well. Aggregation modules save the collected data into the databases using DB modules. Later, the data can be accessed for further analysis by analysis modules. For example, by a Job Digest - one of the approaches for application dynamic characteristics' analysis, developed in MSU under the project. Such a digest - is a detailed report on every job run based on system monitoring data.

LAPTA tool is being developed in the terms of joint Russian-European HOPSA project, which stands for “HOlistic Performance System Analysis”.

The HOPSA project (HOListic Performance System Analysis) therefore sets out for the first time for combined application and system tuning developing an integrated diagnostic infrastructure. Using more powerful diagnostic tools application developers and system administrators will easier identify the root causes of their respective bottlenecks. With the HOPSA infrastructure, it is more effective to optimize codes running on HPC systems: First, more efficient codes means either getting results faster or being able to get higher quality or more results in the same time.

Overall job flow

The work in HOPSA was carried out by two coordinated projects funded by the EU under call FP7-ICT-2011-EU-Russia and the Russian Ministry of Education and Science. While the Russian consortium focused on the system aspect, the EU consortium focused on the application aspect. At the interface between these two facets of our holistic approach is the system-wide performance screening of individual jobs, pointing at both inefficiencies of individual applications and system-related performance issues. For HPC application tuning, developers can choose from a variety of mature performance-analysis tools developed by our consortium.

The HOPSA project delivers an innovative holistic and integrated tool suite for the optimization of HPC applications integrated with system-level monitoring. The tools are used by the HPC support teams of project partners in daily work resulting application performance improvements.

Taking an integrated approach for the first time world wide the involved 7 universities and research institution considerably strengthens their scientific position as competence centres in HPC. The project also results in a tighter collaboration of HPC researchers from the EU and Russia.

Russian partners:

  • Research Computing Center, Moscow State University (Russian coordinator);
  • T-Platforms;
  • Joint Supercomputer Center of the Russian Academy of Sciences;
  • Scientific Research Institute of Multiprocessor Computer Systems, Southern Federal University.

EU partners:

  • Forschungszentrum Julich GmbH (EU coordinator);
  • Rogue Wave Software AB;
  • Barcelona Supercomputing Center;
  • German Research School for Simulation Sciences;
  • Technical University Dresden.

User login