Learning for large-scale parallel platform control.

Authors

REIS Valentin
TRYSTRAM Denis
LELONG Jerome
LEGRAND Arnaud
KAUFMANN Emilie
NGUYEN Kim thang
GOLDMAN Alfredo
TAUFER Michela

Publication date

2018

Publication type

Thesis

Summary Providing the computing infrastructures needed to solve the complex problems of modern society is a strategic challenge. Organizations traditionally respond to this challenge by setting up large parallel and distributed computing infrastructures. Vendors of High Performance Computing systems are driven by competition to produce ever more computing and storage power, leading to specific and sophisticated "Petascale" platforms, and soon to "Exascale" machines. These systems are centrally managed with the help of job management software solutions and dedicated resources. A special problem that these software solutions address is the scheduling problem, where the resource manager must choose when, and on which resources, to execute which computational task. This thesis provides solutions to this problem. All platforms are different. Indeed, their infrastructure, the behavior of their users and the objectives of the host organization vary. We therefore argue that scheduling policies must adapt to the behavior of the systems. In this paper, we present several ways to achieve this adaptability. Through an experimental approach, we study several tradeoffs between the complexity of the approach, the potential gain, and the risks taken.

See the publication

Topics of the publication

Themes detected by scanR from retrieved publications. For more information, see https://scanr.enseignementsup-recherche.gouv.fr