BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160731Z
LOCATION:D171/173
DTSTART;TZID=America/Chicago:20181116T104500
DTEND;TZID=America/Chicago:20181116T110000
UID:submissions.supercomputing.org_SC18_sess145_ws_p3hpc110@linklings.com
SUMMARY:Performance Portability Challenges for Fortran Applications
DESCRIPTION:Workshop\nHeterogeneous Systems, Performance, Workshop Reg Pas
 s\n\nPerformance Portability Challenges for Fortran Applications\n\nHsu, N
 eill, Schoonover, Jibben, Carlson...\n\nThis project investigates how diff
 erent approaches to parallel optimization impact the performance portabili
 ty for Fortran codes. In addition, we explore the productivity challenges 
 due to the software tool-chain limitations unique to Fortran. For this stu
 dy, we build upon the Truchas software, a metal casting manufacturing simu
 lation code based on unstructured mesh methods and our initial efforts for
  accelerating two key routines, the gradient and mimetic finite difference
  calculations. The acceleration methods include OpenMP, for CPU multi-thre
 ading and GPU offloading, and CUDA for GPU offloading. Through this study,
  we find that the best optimization approach is dependent on the prioritie
 s of performance versus effort and the architectures that are targeted. CU
 DA is the most attractive where performance is the main priority, whereas 
 the OpenMP on CPU and GPU approaches are preferable when emphasizing produ
 ctivity. Furthermore, OpenMP for the CPU is the most portable across archi
 tectures. OpenMP for CPU multi-threading yields 3%-5% of achievable perfor
 mance, whereas the GPU offloading generally results in roughly 74%-90% of 
 achievable performance. However, GPU offloading with OpenMP 4.5 results in
  roughly 5% peak performance for the mimetic finite difference algorithm, 
 suggesting further serial code optimization to tune this kernel. In genera
 l, these results imply low performance portability, below 10% as estimated
  by the Pennycook metric. Though these specific results are particular to 
 this application, we argue that this is typical of many current scientific
  HPC applications and highlights the hurdles we will need to overcome on t
 he path to exascale.
URL:https://sc18.supercomputing.org/presentation/?id=ws_p3hpc110&sess=sess
 145
END:VEVENT
END:VCALENDAR