Home   Publications     edited volumes   Awards   Research   Teaching   Miscellaneous   Full CV [pdf]   BLOG   bio
  
 
 
  
 
  
  Events
  
  
  
  
   
  
   Past Events
  
  
  
  
  
  
   
    | 
Publications of Torsten Hoefler  
K. B. Ferreira, P. Widener, S. Levy, D. Arnold, Torsten Hoefler:
 
  |  |   | Understanding the Effects of Communication and Coordination on Checkpointing at Scale
   (presented in New Orleans, LA, USA, Nov. 2014, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14) ) 
 
 AbstractFault-tolerance poses a major challenge for future
    large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the
    introduction of asynchrony can address anticipated scalability
    issues. However, few insights into selection and tuning of these
    protocols for applications at scale have emerged. In this paper, we
    use a simulation-based approach to show that local checkpoint
    activity in resilience mechanisms can significantly affect the
    performance of key workloads, even when less than 1% of a
    local node’s compute time is allocated to resilience mechanisms
    (a very generous assumption). Specifically, we show that even
    though much work on uncoordinated checkpointing has focused
    on optimizing message log volumes, local checkpointing activity
    may dominate the overheads of this technique at scale. Our
    study shows that local checkpoints lead to process delays that
    can propagate through messaging relations to other processes
    causing a cascading series of delays. We demonstrate how to tune
    hierarchical uncoordinated checkpointing protocols designed to
    reduce log volumes to significantly reduce these synchronization
    overheads at scale. Our work provides a critical analysis and
    comparison of coordinated and uncoordinated checkpointing and
    enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics.
 
 Documentsdownload article:  
  |  |   | BibTeX |  @inproceedings{uncoordinated-cr-communication,   author={K. B. Ferreira and P. Widener and S. Levy and D. Arnold and Torsten Hoefler},   title={{Understanding the Effects of Communication and Coordination on Checkpointing at Scale}},   year={2014},   month={Nov.},   location={New Orleans, LA, USA},   note={Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC14)},   source={http://www.unixer.de/~htor/publications/}, } |  
  |  
  
 
 |