<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" id="owaParaStyle"></style>

</head>

<body fpstyle="1" ocsi="0">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">

<div>CALL FOR PAPERS</div>

<div>3nd International Workshop on </div>

<div>Fault-Tolerance for HPC at Extreme Scale (FTXS 2013)</div>

<div><br>

</div>

<div>In conjunction with</div>

<div>The 22nd International ACM Symposium on </div>

<div>High Performance Parallel and Distributed Computing (HPDC 2013)</div>

<div>New York City, New York, USA on June 17-21, 2013</div>

<div><br>

</div>

<div>WORKSHOP MOTIVATION</div>

<div>For the HPC community, a new scaling in numbers of processing elements</div>

<div>has superseded the historical trend of Moore's Law scaling in</div>

<div>processor frequencies. This progression from single core to multi-core</div>

<div>and many-core will be further complicated by the community's imminent</div>

<div>migration from traditional homogeneous architectures to ones that are</div>

<div>heterogeneous in nature. As a consequence of these trends, the HPC</div>

<div>community is facing rapid increases in the number, variety, and</div>

<div>complexity of components, and must thus overcome increases in</div>

<div>aggregate fault rates, fault diversity, and complexity of isolating</div>

<div>root cause.</div>

<div><br>

</div>

<div>Recent analyses demonstrate that HPC systems experience simultaneous</div>

<div>(often correlated) failures. In addition, statistical analyses suggest</div>

<div>that silent soft errors can not be ignored anymore, because the</div>

<div>increase of components, memory size and data paths (including</div>

<div>networks) make the probability of silent data corruption (SDC)</div>

<div>non-negligible. The HPC community has serious concerns regarding this</div>

<div>issue and application users are less confident that they can rely on a</div>

<div>correct answer to their computations. Other studies have indicated a</div>

<div>growing divergence between failure rates experienced by applications</div>

<div>and rates seen by the system hardware and software. At Exascale, some</div>

<div>scenarios project failure rates reaching one failure per hour. This</div>

<div>conflicts with the current checkpointing approach to fault tolerance</div>

<div>that requires up to 30 minutes to restart a parallel execution on the</div>

<div>largest systems.  Lastly, stabilization periods for the largest</div>

<div>systems are already significant, and the possibility that these could</div>

<div>increase in length is of great concern.  During the Approaching</div>

<div>Exascale report at SC11, DOE program managers identified resilience</div>

<div>as a black swan - the most difficult under-addressed issue facing HPC.</div>

<div><br>

</div>

<div>OPEN QUESTIONS</div>

<div>What does the fault-tolerance community need to do in order to be</div>

<div>prepared to face the challenges of extreme scale computing? What is</div>

<div>needed to keep applications with billions of threads of parallelism up</div>

<div>and running on systems that fail tens of times per day? As models</div>

<div>predict less than 50% efficiency of traditional checkpoint/restart</div>

<div>methods on future systems, are we ready to pay the cost of full</div>

<div>redundancy, effectively performing redundant multi-threading (RMT)</div>

<div>across entire systems? Do we even have the infrastructure necessary to</div>

<div>implement an RMT strategy?</div>

<div><br>

</div>

<div>How is the supercomputing community going to efficiently isolate</div>

<div>failures on enormously complex systems? Is there any chance to</div>

<div>understand these systems in such a way that some failure could be</div>

<div>predicted with enough accuracy and anticipation to trigger useful</div>

<div>failure avoidance actions? What can the community do to protect</div>

<div>applications from SDC in memory and logic? How far the user and the</div>

<div>programmer should be involved in managing faults? What are the most</div>

<div>promising self-healing numerical methods?</div>

<div><br>

</div>

<div>GOALS</div>

<div>The goals of this workshop are to consider these complex questions, to</div>

<div>discuss the unique limitations that extreme scale and complexity</div>

<div>impose on traditional methods of fault-tolerance, and to explore new</div>

<div>strategies for dealing with those challenges.</div>

<div><br>

</div>

<div>PAPER SUBMISSIONS</div>

<div>Submissions are solicited in the following categories:</div>

<div>* Regular papers presenting innovative ideas improving the state of the art.</div>

<div>* Experience papers discussing the issues seen on existing extreme-scale</div>

<div> systems, including some form of analysis and evaluation.</div>

<div>* Extended abstracts proposing disruptive ideas in the field,</div>

<div> including some form of preliminary results</div>

<div><br>

</div>

<div>Submissions shall be sent electronically, must conform to IEEE</div>

<div>conference proceedings style and should not exceed eight pages including</div>

<div>all text, appendices, and figures.</div>

<div><br>

</div>

<div>TOPICS</div>

<div>Assuming hardware and software errors will be inescapable at extreme</div>

<div>scale, this workshop will consider aspects of fault tolerance peculiar</div>

<div>to extreme scale that include, but are not limited to:</div>

<div>* Quantitative assessments of cost in terms of power, performance, and</div>

<div> resource impacts of fault-tolerant techniques, such as checkpoint</div>

<div> restart, that are redundant in space, time or information</div>

<div>* Novel fault-tolerance techniques and implementations of emerging</div>

<div> hardware and software technologies that guard against silent data</div>

<div> corruption (SDC) in memory, logic, and storage and provide</div>

<div> end-to-end data integrity for running applications; Studies of</div>

<div> hardware / software tradeoffs in error detection, failure</div>

<div> prediction, error preemption, and recovery</div>

<div>* Advances in monitoring, analysis, and control of highly complex systems</div>

<div>* Highly scalable fault-tolerant programming models</div>

<div>* Metrics and standards for measuring, improving and enforcing the</div>

<div> need for and effectiveness of fault-tolerance</div>

<div>* Failure modeling and scalable methods of reliability, availability,</div>

<div> performability and failure prediction for fault-tolerant HPC</div>

<div> systems</div>

<div>* Scalable Byzantine fault tolerance and security from single-fault</div>

<div> and fail-silent violations</div>

<div>* Benchmarks and experimental environments, including fault-injection</div>

<div> and accelerated lifetime testing, for evaluating performance of</div>

<div> resilience techniques under stress</div>

<div><br>

</div>

<div>IMPORTANT DATES</div>

<div>Submission of papers: February 11th, 2013</div>

<div>Author notification: March 18th, 2013</div>

<div>Camera ready papers: April 15th, 2013</div>

<div>Workshop: June 17th or June 18th, 2013</div>

<div><br>

</div>

<div>WORKSHOP ORGANIZERS</div>

<div>Nathan DeBardeleben - Los Alamos National Laboratory</div>

<div>Jon Stearley - Sandia National Laboratories</div>

<div>Franck Cappello - INRIA & University of Illinois at Urbana Champaign</div>

<div><br>

</div>

<div>PROGRAM COMMITTEE</div>

<div>Rob Aulwes - Los Alamos National Laboratory</div>

<div>Clayton Chandler - Department of Defense</div>

<div>Robert Clay - Sandia National Laboratories</div>

<div>John Daly - Department of Defense </div>

<div>Christian Engelmann - Oak Ridge National Laboratory</div>

<div>Felix Salfner - SAP Innovation Center Potsdam </div>

<div>Kurt Ferreira - Sandia National Laboratories</div>

<div>Ana Gainaru - University of Illinois at Urbana-Champaign</div>

<div>Leonardo Bautista Gomez - Tokyo Institute of Technology</div>

<div>Hideyuki Jitsumoto - The University of Tokyo</div>

<div>Rakesh Kumar - University of Illinois, Urbana-Champaign </div>

<div>Zhiling Lan - Illinois Institute of Technology</div>

<div>Naoya Maruyama - Tokyo Institute of Technology</div>

<div>Kathryn Mohror - Lawrence Livermore National Laboratory</div>

<div>Rolf Riesen - IBM Research - Ireland</div>

<div>Yve Robert - ENS Lyon<span class="Apple-tab-span" style="white-space: pre; "></span></div>

<div><br>

</div>

<div>See http://institute.lanl.gov/resilience/workshops/ftxs2013/ for </div>

<div>more information.</div>

<div>

<div class="BodyFragment"></div>

</div>

</div>

</body>

</html>