LTH-image

Power and temperature control for large-scale computing infrastructures

Researcher: Martina Maggio
Funding: VR
Duration: 2014-2017

Modern computing systems are constrained by dark silicon, the abundance of transistors enables processors to draw more power than they can safely sustain. For example, the Exynos 5 processor (in the Samsung Galaxy S4 phone) has a 5.5W peak power that is nearly twice the maximum sustainable heat dissipation, limiting peak speed to less than 1 second. At the other end of the spectrum, the next generation of exascale supercomputers is predicted to be constrained by an operating budget of approximately 20 MW. In addition, Microsoft was recently fined for not using enough power and violating an agreement with a utility company.  Executing efficient code in these systems requires solving a constrained optimization problem: maintaining the power budget, while maximizing performance within the power constraint.

Many separate components contribute to total power consumption and various techniques have been proposed to manage individual components. For example, management systems exist for CPU allocation, dynamic voltage and frequency scaling, processor idling, cache, DRAM, and disk. However, the coordination of these many actuators is non-trivial and requires knowledge on all the potential nonlinearities that the hardware infrastructure may expose. The goal of this research is to develop a platform-independent resource manager to control the temperature and power consumption of large computing infrastructures like data centers. This management system should be general with respect to the running platform and must address three challenges:

  • Unknowns: prior research approaches rely on rigorous models for either the specific machine under control or for a specific application and platform. A generalized power management system, however, must either construct its models on the fly or compensate for inaccuracies and unknowns in the model.
  • Interaction: System components interact to produce a complex (often nonlinear) effect on power, temperature and performance. If individual components are controlled separately, their interaction can lead to suboptimal behavior, even when these separate controllers are individually optimal. Thus, a generalized power management system must coordinate all available components even if they are not known at design time or vary at runtime.
  • Optimization: A power manager must not exceed the power budget, yet must also deliver the best possible performance for a given budget. A generalized approach must not sacrifice too much performance for generality.

This research addresses the above challenges, the result so far has been a machine-level power management system that is general with respect to the components it manages, and uses feedback control to ensure that the power and temperature budget are respected, while delivering the best possible performance to the running applications. The project originated by a publication at PACT 2013 (Parallel Architectures and Compilation Techniques) titled ThermOS: System Support for Dynamic Thermal Management of Chip Multi-Processors. It has lead in 2014 to the publication of the article PCP: A Generalized Approach to Optimizing Performance Under Power Constraints through Resource Management at ICAC 2014 (International Conference on Autonomic Computing). The follow up on this research has been presented at RTAS 2015 (21st IEEE Real-Time and Embedded Technology and Applications Symposium) with the paper POET: a portable approach to minimizing energy under soft real-time constraints and at FSE 2015 (Foundations on Software Engineering) with the paper Automated multi-objective control for self-adaptive software design.

Puiblications

Stepan Shevtsov, Danny Weyns, Martina Maggio: "Handling New and Changing Requirements with Guarantees in Self-Adaptive Systems using SimCA". In: The 12th International Symposium on Software Engineering for Adaptive and Self-Managing Systems 2017.

Martina Maggio, Alessandro Vittorio Papadopoulos, Antonio Filieri, Henry Hoffmann: "Self-Adaptive Video Encoder : Comparison of Multiple Adaptation Strategies Made Simple". In: 12th IEEE/ACM International Symposium on Software Engineering for Adaptive and Self-Managing Systems, SEAMS 2017 2017.

Martina Maggio: "Real-Time Implementation of Control Systems". In: Handbook of Cyber-Physical Systems, Springer, 2017.

Alex Iosup, Xiaoyun Zhu, Arif Merchant, Eva Kalyvianaki, Martina Maggio, Simon Spinner, Tarek Abdelzaher, Ole J. Mengshoel, Sara Bouchenak: "Self-awareness of Cloud Applications". In S. Kounev, J.O. Kephart, A. Milenkoski, X. Zhu (Eds.): Self-Aware Computing Systems, Springer Verlag, 2017.

Simon Spinner, Antonio Filieri, Samuel Kounev, Martina Maggio, Anders Robertsson: "Run-Time Models for Online Performance and Resource Management in Data Centers". In S. Kounev, J.O. Kephart, A. Kephart, X. Zhu (Eds.): Self-Aware Computing Systems, Springer Verlag, 2017.

Nikolas Herbst, Steffen Becker, Samuel Kounev, Heiko Koziolek, Martina Maggio, Evgenia Smirni: "Metrics and Benchmarks for Self-aware Computing Systems". In S. Kounev, J.O. Kephart, A. Milenkoski, X. Zhu (Eds.): Self-Aware Computing Systems, Springer Verlag, 2017.

Martina Maggio, Tarek Abdelzaher, Lukas Esterle, Holger Giese, Jeffrey O. Kephart, Ole J, Mengshoel, Alessandro Vittorio Papadopoulos, Anders Robertsson, Katinka Wolter: "Self-adaptation for Individual Self-aware Computing Systems". In S. Kounev, J.O. Kephart, A. Milenkoski, X. Zhu (Eds.): Self-Aware Computing Systems, Springer Verlag, 2017.

Jeffrey O. Kephart, Martina Maggio, Ada Diaconescu, Holger Giese, Henry Hoffmann, Samuel Kounev, Anne Koziolek, Peter Lewis, Anders Robertsson, Simon Spinner: "Reference Scenarios for Self-aware Computing". In S. Kounev, J.O. Kephart, A. Milenkoski, X. Zhu (Eds.): Self-Aware Computing Systems, Springer Verlag, 2017.