BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160728Z
LOCATION:D166
DTSTART;TZID=America/Chicago:20181112T120000
DTEND;TZID=America/Chicago:20181112T123000
UID:submissions.supercomputing.org_SC18_sess173_ws_espm106@linklings.com
SUMMARY:Automatic Generation of High-Order Finite-Difference Code with Tem
 poral Blocking for Extreme-Scale Many-Core Systems
DESCRIPTION:Workshop\nAccelerators, Exascale, Parallel Programming Languag
 es, Libraries, and Models, Workshop Reg Pass\n\nAutomatic Generation of Hi
 gh-Order Finite-Difference Code with Temporal Blocking for Extreme-Scale M
 any-Core Systems\n\nTanaka, Ishihara, Sakamoto, Nakamura, Kimura...\n\nIn 
 this paper we describe the basic idea, implementation and achieved perform
 ance of our DSL for stencil computation, Formura, on systems based on PEZY
 -SC2 many-core processor. Formura generates, from high-level description o
 f the differential equation and simple description of finite-difference st
 encil, the entire simulation code with MPI parallelization with overlapped
  communication and calculation, advanced temporal blocking and paralleliza
 tion for many-core processors.  Achieved performance is 4.78 PF, or 21.5% 
 of the theoretical peak performance for an explicit scheme for compressive
  CFD, with the accuracy of fourth-order in space and third-order in time. 
 For a slightly modified implementation of the same scheme, efficiency was 
 slightly lower (17.5%)  but actual calculation time per one timestep was f
 aster by 25%.  Temporal blocking improved the performance by up to 70%.  E
 ven though the B/F number of PEZY-SC2 is low, around 0.02, we have achieve
 d the efficiency comparable to those of highly optimized CFD codes on mach
 ines with much higher memory bandwidth such as K computer. We have demonst
 rated that automatic generation of the code with temporal blocking is a qu
 ite effective way to make use of very large-scale machines with low memory
  bandwidth for large-scale CFD calculations.
URL:https://sc18.supercomputing.org/presentation/?id=ws_espm106&sess=sess1
 73
END:VEVENT
END:VCALENDAR

