You are here

Distributed Modular Monitoring System: a New Approach to Supercomputer Monitoring

State-of-art monitoring software was designed for monitoring network services and hardware. Monitoring supercomputers has several goals differing in required reliability, complexity of used algorithms and data volume to be processed.

Existing monitoring software fits well for monitoring static configurations. Monitoring infrastructure and basic fitness of hardware components can be done well with such software.

Monitoring dynamic entities, e.g. performance monitoring of running jobs, is more difficult task. Performance data stream is large. Storing it in database and fetching it afterwards for processing leads to high load on storage, and makes it necessary to use appropriate expensive hardware capable of achieving high level of IO.

We propose a new approach to building distributed modular monitoring system named DiMMon, both for current and future systems, with an extremely high degree of parallelism. This system is intended both for state monitoring and performance monitoring.

The DiMMon monitoring system framework design is based on the following principles:

  • Ability to direct different data flows along different routes, or copy the same data to several recipients for different processing functions.
  • Support for the dynamic reconfiguration of the monitoring system operation modes (data transmission routes, data collection parameters, data processing rules).
  • Ability to calculate performance metrics for individual jobs while collecting data, without writing it to disk and subsequently reading.

Sending different data along different routes allows collecting data for both state monitoring and performance monitoring in one place and send it for further processing to the respective parts of the monitoring system. The data needed to calculate individual job performance metrics are sent to the components created for that specific job. Reconfiguration of the data routes is done without restarting the system. This allows metrics to be calculated during operations, without storing the data in the database since the data for the given job can be directed for processing to a component for that job.