Octotron: Active Control for Reliable Functioning of Supercomputers

http://github.com/srcc-msu/octotron (in English)

A state-of-the-art supercomputer is an extremely complex, expensive and energy-saturated system. Its every component is unreliable and can fail any time. That may lead not only to application failures but even to equipment damages.
Keeping this in mind, we maintain a project aimed to provide the highest possible safety of supercomputer hardware as well as the highest possible rate of computing resources usage. General requirements we specified for the system
called Octotron are:

  • Detection of all damage/failure sources which may occur within supercomputer or data center;
  • Automated reactions to the detected failures trying to eliminate unwanted
  • Flexibility and independence of specific supercomputer architecture;
  • Scalability, thinking not only about today, but also about tomorrow supercomputers;
  • Strong self-diagnostics support, since the system itself could also work incorrectly;
  • Succession support, by way of creating a database of typical damages and reactions which can be disseminated and enriched by supercomputing community.

Key feature of the Octotron system is representing the supercomputer functioning model in the form of expanded multi-graph. The vertices of the graph represent supercomputer components (nodes, queues, software, etc.), while its edges represent relations between them (“contain”, “chill”, “power”, etc.). The vertices have attributes which represent components’ properties received from the monitoring system (temperatures, counters, etc.), and rules – functions for failure detection. In case of a failure a rule calls a reaction, e.g. sending a message or running a script. Graph structure allows us to investigate a propagation of failures from the top of failure source by implementing appropriate rules. We use Neo4j as a graph storage engine and Python for model, rules and reactions description.

Octotron is available under an open MIT license. We use Octotron now to control MSU “Chebyshev” and “Lomonosov” supercomputers. Examples of errors detected by Octotron are going beyond temperature thresholds; time drift, SSH/MPI unavailability, and high load average level on computing nodes; suspicious states of the job queue; growing of errors in network interfaces.

Presentation materials:

User login