Performance System Analysis

To maximize the scientific and commercial output of a high performance computing system, different stakeholders pursue different strategies. While individual application developers are trying to shorten the time to solution by optimizing their codes, system administrators are tuning the configuration of the overall system to increase its throughput. Yet, the complexity of today's machines with their strong interrelationship between application and system performance presents serious challenges to achieve these goals. Our experience indicates that even aspects like job scheduling or managing quotas can often lead to performance degradation.



All the reasons mentioned above indicate the demand for specific tools that would allow seeing where and, what is more important, why does the performance loss happen. It is absolutely clear that such tools should consider a variety of special data on job behavior and supercomputer status to provide high-quality and versatile estimates. The toolkit of this type is being developed in the Laboratory of Parallel Informational Technologies of Research Computing Center of Lomonosov Moscow State University.



Job Digest

The system is designed for versatile analysis of supercomputer parallel programs dynamic characteristics. The characteristics from submitting the job to its finish are taken into consideration during such analysis. This allows obtaining full information both on a single job and on all simultaneous jobs running on the system. Data from various hardware sensors is being collected by agents and is passed to aggregation level on cluster nodes. Agents collect data from the Resource Manager as well. Aggregation modules save the collected data into the databases using DB modules. Later, the data can be accessed for further analysis by analysis modules. For example, by a Job Digest – one of the approaches for application dynamic characteristics' analysis, developed in MSU under the project. Such a digest – is a detailed report on every job run based on system monitoring data.

Complex performance system analysis is nowadays one of the major research trends aimed at securing efficient HPC resources utilization of MSU and other supercomputer centers.

User login