Discamus continentiam augere, luxuriam coercere
Home -> Publications
Home
  Publications
    
all years
    2019
    2018
    2017
    2016
    2015
    2014
    2013
    2012
    2011
    2010
    2009
    2008
    2007
    2006
    2005
    2004
    theses
    techreports
    presentations
    edited volumes
    conferences
  Awards
  Research
  Teaching
  BLOG
  Miscellaneous
  Full CV [pdf]






  Events








  Past Events





Publications of Torsten Hoefler
Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Martin Kuettler, Maksym Planeta, Jan Bierbaum, Carsten Weinhold, Hermann Haertig, Amnon Barak, Torsten Hoefler:

 Corrected Trees for Reliable Group Communication

(Feb. 2019, Accepted at The ACM Conference Principles and Practice of Parallel Programming 2019 (PPoPP'19) )

Abstract

Driven by ever increasing performance demands of compute-intensive applications, supercomputing systems comprise more and more nodes. This growth is a significant burden for fast group communication primitives and also makes those systems more susceptible to failures of individual nodes. In this paper we present a two-phase fault-tolerant scheme for group communication. Using broadcast as an example, we provide a full-spectrum discussion of our approach — from a formal analysis to LogP-based simulations to a message-passing-based implementation running on a large cluster. Ultimately, we are able to reduce the complex problem of reliable and fault-tolerant collective group communication to a graph theoretical renumbering problem. Both simulations and measurements show our solution to achieve a latency reduction of 50% with up to 6 times fewer messages sent in comparison to existing schemes.

Documents

download article:
download slides:
 

BibTeX

@inproceedings{,
  author={Martin Kuettler and Maksym Planeta and Jan Bierbaum and Carsten Weinhold and Hermann Haertig and Amnon Barak and Torsten Hoefler},
  title={{Corrected Trees for Reliable Group Communication}},
  year={2019},
  month={Feb.},
  note={Accepted at The ACM Conference Principles and Practice of Parallel Programming 2019 (PPoPP'19)},
  source={http://www.unixer.de/~htor/publications/},
}

serving: 54.167.47.248:47558© Torsten Hoefler