Presentation
CPU Overheating Characterization in HPC Systems: a Case Study
Author/Presenters
Event Type
Workshop
W
Resiliency
Scientific Computing
TimeFriday, November 16th11:10am - 11:30am
LocationD174
DescriptionWith the increase in size of supercomputers, the number of abnormal events also increases. Some of these events might lead to an application failure. Others might simply impact the system efficiency. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper studies the problem of CPU overheating in supercomputers. In a first part, we analyze data collected over one year on a supercomputer of the Top500 list to understand under which conditions CPU overheating occurs. Our analysis show that overheating events are due to some specific applications. In a second part, we evaluate the impact of such overheating events on the performance of MPI applications. Using 6 representative HPC benchmarks, we show that for a majority of the applications, a frequency drop on one CPU impacts the execution time of distributed runs proportionally to the duration and to the extent of the frequency drop.
Archive

