BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160731Z
LOCATION:C146
DTSTART;TZID=America/Chicago:20181114T140000
DTEND;TZID=America/Chicago:20181114T143000
UID:submissions.supercomputing.org_SC18_sess216_pap381@linklings.com
SUMMARY:Partial Redundancy in HPC Systems with Non-Uniform Node Reliabilit
 ies
DESCRIPTION:Paper\nPerformance, Resiliency, Tools, Tech Program Reg Pass\n
 \nPartial Redundancy in HPC Systems with Non-Uniform Node Reliabilities\n\
 nHussain, Znati, Melhem\n\nWe study the usefulness of partial redundancy i
 n HPC message passing systems where individual node failure distributions 
 are not identical. Prior research works on fault tolerance have generally 
 assumed identical failure distributions for the nodes of the system. In su
 ch settings, partial replication has never been shown to outperform the tw
 o extremes (full and no-replication) for any significant range of node cou
 nts. We argue that partial redundancy may provide the best performance und
 er the more realistic assumption of non-identical node failure distributio
 ns. We provide theoretical results on arranging nodes with different relia
 bility values among replicas such that system reliability is maximized. Mo
 reover, using system reliability to compute MTTI (mean-time-to-interrupt) 
 and expected completion time of a partially replicated system, we numerica
 lly determine the optimal partial replication degree. Our results indicate
  that partial replication can be a more efficient alternative to full repl
 ication at system scales where Checkpoint/Restart alone is not sufficient.
URL:https://sc18.supercomputing.org/presentation/?id=pap381&sess=sess216
END:VEVENT
END:VCALENDAR

