Learning for large-scale parallel platform control.

Authors
Publication date
2018
Publication type
Thesis
Summary Providing the computing infrastructures needed to solve the complex problems of modern society is a strategic challenge. Organizations traditionally respond to this challenge by setting up large parallel and distributed computing infrastructures. Vendors of High Performance Computing systems are driven by competition to produce ever more computing and storage power, leading to specific and sophisticated "Petascale" platforms, and soon to "Exascale" machines. These systems are centrally managed with the help of job management software solutions and dedicated resources. A special problem that these software solutions address is the scheduling problem, where the resource manager must choose when, and on which resources, to execute which computational task. This thesis provides solutions to this problem. All platforms are different. Indeed, their infrastructure, the behavior of their users and the objectives of the host organization vary. We therefore argue that scheduling policies must adapt to the behavior of the systems. In this paper, we present several ways to achieve this adaptability. Through an experimental approach, we study several tradeoffs between the complexity of the approach, the potential gain, and the risks taken.
Topics of the publication
Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr