BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160727Z
LOCATION:D167/174
DTSTART;TZID=America/Chicago:20181112T103000
DTEND;TZID=America/Chicago:20181112T110000
UID:submissions.supercomputing.org_SC18_sess151_ws_mlhpce110@linklings.com
SUMMARY:Communication-Efficient Parallelization Strategy for Deep Convolut
 ional Neural Network Training
DESCRIPTION:Workshop\nDeep Learning, Machine Learning, Workshop Reg Pass\n
 \nCommunication-Efficient Parallelization Strategy for Deep Convolutional 
 Neural Network Training\n\nLee, Agrawal, Balaprakash, Choudhary, Liao\n\nT
 raining modern Convolutional Neural Network (CNN) models is extremely time
 -consuming, and the efficiency of its parallelization plays a key role in 
 finishing the training in a reasonable amount of time. The well-known para
 llel synchronous Stochastic Gradient Descent (SGD) algorithm suffers from 
 high costs of inter-process communication and synchronization. To address 
 such problems, the asynchronous SGD algorithm employs a master-slave model
  for parameter update. However, it can result in a poor convergence rate d
 ue to the staleness of gradient. In addition, the master-slave model is no
 t scalable when running on a large number of compute nodes. In this paper,
  we present a communication-efficient gradient averaging algorithm for syn
 chronous SGD, which adopts a few design strategies to maximize the degree 
 of overlap between computation and communication. The time complexity anal
 ysis shows our algorithm outperforms the traditional algorithms that use M
 PI allreduce-based communication. By training the two popular deep CNN mod
 els, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed 
 on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorith
 m can achieve up to 2516.36x speedup for VGG-16 and 2734.25x speedup for R
 esNet-50 when running on up to 8192 cores.
URL:https://sc18.supercomputing.org/presentation/?id=ws_mlhpce110&sess=ses
 s151
END:VEVENT
END:VCALENDAR

