Presentation
Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilities
SessionResilience 2
Event Type
Paper
TP
Performance
Resiliency
Tools
TimeWednesday, November 14th2pm - 2:30pm
LocationC146
DescriptionWe study the usefulness of partial redundancy in HPC message passing systems where individual node failure distributions are not identical. Prior research works on fault tolerance have generally assumed identical failure distributions for the nodes of the system. In such settings, partial replication has never been shown to outperform the two extremes (full and no-replication) for any significant range of node counts. We argue that partial redundancy may provide the best performance under the more realistic assumption of non-identical node failure distributions. We provide theoretical results on arranging nodes with different reliability values among replicas such that system reliability is maximized. Moreover, using system reliability to compute MTTI (mean-time-to-interrupt) and expected completion time of a partially replicated system, we numerically determine the optimal partial replication degree. Our results indicate that partial replication can be a more efficient alternative to full replication at system scales where Checkpoint/Restart alone is not sufficient.
Download PDF
Archive




