<html dir="ltr">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">

<style type="text/css" id="owaParaStyle"></style>

</head>

<body fpstyle="1" ocsi="0">

<div style="direction: ltr;font-family: Tahoma;color: #000000;font-size: 10pt;">

<div><b>Joining with the other DSN workshops, we have extended the deadline for paper submissions </b></div>

<div><b>by 1 week to March 14, 2014.  Author notifications have also been extended to April 11, 2014.</b></div>

<div><br>

</div>

<div>CALL FOR PAPERS</div>

<div>4th International Workshop on Fault-Tolerance for HPC at Extreme Scale </div>

<div>(FTXS 2014)</div>

<div><br>

</div>

<div>In conjunction with</div>

<div>The 44th Annual IEEE/IFIP International Conference on </div>

<div>Dependable Systems and Networks (DSN 2014)</div>

<div>Atlanta, Georgia, USA on June 23-26, 2014</div>

<div><br>

</div>

<div>WORKSHOP MOTIVATION</div>

<div>For the HPC community, a new scaling in numbers of processing elements</div>

<div>has superseded the historical trend of Moore's Law scaling in processor</div>

<div>frequencies. This progression from single core to multi-core and</div>

<div>many-core will be further complicated by the community's imminent</div>

<div>migration from traditional homogeneous architectures to ones that are</div>

<div>heterogeneous in nature. As a consequence of these trends, the HPC</div>

<div>community is facing rapid increases in the number, variety, and</div>

<div>complexity of components, and must thus overcome increases in aggregate</div>

<div>fault rates, fault diversity, and complexity of isolating root cause.</div>

<div><br>

</div>

<div>Recent analyses demonstrate that HPC systems experience simultaneous</div>

<div>(often correlated) failures. In addition, statistical analyses suggest</div>

<div>that silent soft errors cannot be ignored anymore, because the increase</div>

<div>of components, memory size and data paths (including networks) make the</div>

<div>probability of silent data corruption (SDC) non-negligible. The HPC</div>

<div>community has serious concerns regarding this issue and application</div>

<div>users are less confident that they can rely on a correct answer to their</div>

<div>computations. Other studies have indicated a growing divergence between</div>

<div>failure rates experienced by applications and rates seen by the system</div>

<div>hardware and software. At Exascale, some scenarios project failure rates</div>

<div>reaching one failure per hour. This conflicts with the current</div>

<div>checkpointing approach to fault tolerance that requires up to 30 minutes</div>

<div>to restart a parallel execution on the largest systems.  Lastly,</div>

<div>stabilization periods for the largest systems are already significant,</div>

<div>and the possibility that these could increase in length is of great</div>

<div>concern.  During the Approaching Exascale report at SC11, DOE program</div>

<div>managers identified resilience as a black swan - the most difficult</div>

<div>under-addressed issue facing HPC.</div>

<div><br>

</div>

<div>OPEN QUESTIONS</div>

<div>What does the fault-tolerance community need to do in order to be</div>

<div>prepared to face the challenges of extreme scale computing? What is</div>

<div>needed to keep applications with billions of threads of parallelism up</div>

<div>and running on systems that fail tens of times per day? As models</div>

<div>predict less than 50% efficiency of traditional checkpoint/restart</div>

<div>methods on future systems, are we ready to pay the cost of full</div>

<div>redundancy, effectively performing redundant multi-threading (RMT)</div>

<div>across entire systems? Do we even have the infrastructure necessary to</div>

<div>implement an RMT strategy?</div>

<div><br>

</div>

<div>How is the supercomputing community going to efficiently isolate</div>

<div>failures on enormously complex systems? Is it realistic to understand</div>

<div>these systems in such a way that some failure could be predicted with</div>

<div>enough accuracy and anticipation to trigger useful failure avoidance</div>

<div>actions? What can the community do to protect applications from SDC in</div>

<div>memory and logic? To what extent should users and programmers be</div>

<div>involved in managing faults? What are the most promising self-healing</div>

<div>numerical methods?  Is there an emerging framework for fault management</div>

<div>at extreme scale?</div>

<div><br>

</div>

<div>GOALS</div>

<div>The goals of this workshop are to consider these complex questions, to</div>

<div>discuss the unique limitations that extreme scale and complexity impose</div>

<div>on traditional methods of fault-tolerance, and to explore new strategies</div>

<div>for dealing with those challenges.</div>

<div><br>

</div>

<div>PAPER SUBMISSIONS</div>

<div>Submissions are solicited in the following categories:</div>

<div>* Regular papers presenting innovative ideas improving the state of the art.</div>

<div>* Experience papers discussing the issues seen on existing extreme-scale </div>

<div>  systems, including some form of analysis and evaluation.</div>

<div>* Extended abstracts proposing disruptive ideas in the field, including </div>

<div>  some form of preliminary results</div>

<div><br>

</div>

<div>Submissions shall be sent electronically, must conform to IEEE</div>

<div>conference proceedings style and should not exceed six pages including</div>

<div>all text, appendices, and figures.</div>

<div><br>

</div>

<div>TOPICS</div>

<div>Assuming hardware and software errors will be inescapable at extreme</div>

<div>scale, this workshop will consider aspects of fault tolerance peculiar</div>

<div>to extreme scale that include, but are not limited to: </div>

<div>* Quantitative assessments of cost in terms of power, performance, and </div>

<div>  resource impacts of fault-tolerant techniques, such as checkpoint</div>

<div>  restart, that are redundant in space, time or information</div>

<div>* Novel fault-tolerance techniques and implementations of emerging</div>

<div>  hardware and software technologies that guard against silent data</div>

<div>  corruption (SDC) in memory, logic, and storage and provide end-to-end</div>

<div>  data integrity for running applications; Studies of hardware / software</div>

<div>  tradeoffs in error detection, failure prediction, error preemption, and</div>

<div>  recovery</div>

<div>* Advances in monitoring, analysis, and control of highly complex systems</div>

<div>* Highly scalable fault-tolerant programming models</div>

<div>* Metrics and standards for measuring, improving and enforcing the need </div>

<div>  for and effectiveness of fault-tolerance</div>

<div>* Failure modeling and scalable methods of reliability, availability, </div>

<div>  performability and failure prediction for fault-tolerant HPC systems</div>

<div>* Scalable Byzantine fault tolerance and security from single-fault and </div>

<div>  fail-silent violations</div>

<div>* Benchmarks and experimental environments, including fault-injection </div>

<div>  and accelerated lifetime testing, for evaluating performance of </div>

<div>  resilience techniques under stress</div>

<div>* Frameworks and APIs for fault tolerance and fault management.</div>

<div><br>

</div>

<div>IMPORTANT DATES</div>

<div>Submission of papers: March 14, 2014</div>

<div>Author notification: April 11, 2014</div>

<div>Camera ready papers: April 2014</div>

<div>Workshop: June 23rd, 2014</div>

<div><br>

</div>

<div>WORKSHOP ORGANIZERS</div>

<div>Nathan DeBardeleben - Los Alamos National Laboratory</div>

<div>Franck Cappello – Argonne National Laboratory and the University of</div>

<div>  Illinois at Urbana-Champaign</div>

<div>Robert Clay – Sandia National Laboratories</div>

<div><br>

</div>

<div>PROGRAM COMMITTEE</div>

<div>Rob Aulwes – Los Alamos National Laboratory</div>

<div>Aurélien Bouteiller – University of Tennessee Knoxville</div>

<div>Greg Bronevetsky - Lawrence Livermore National Laboratory</div>

<div>John Daly - Department of Defense</div>

<div>Christian Engelmann – Oak Ridge National Laboratory</div>

<div>Kurt Ferreira – Sandia National Laboratories</div>

<div>Ana Gainaru – University of Illinois at Urbana-Champaign</div>

<div>Leonardo Bautista Gomez – Tokyo Institute of Technology</div>

<div>Hideyuki Jitsumoto – The University of Tokyo</div>

<div>Zhiling Lan – Illinois Institute of Technology</div>

<div>Naoya Maruyama – RIKEN Advanced Institute for Computational Science</div>

<div>Kathryn Mohror – Lawrence Livermore National Laboratory</div>

<div>Bogdan Nicolae – IBM Research – Ireland</div>

<div>Rolf Riesen – IBM Research – Ireland</div>

<div>Yve Robert - ENS Lyon</div>

<div>Thomas Ropars - EPFL</div>

<div>Stephen Scott – Tennessee Tech University and Oak Ridge National</div>

<div>Laboratory</div>

<div>Vilas Sridharan – AMD, Inc.</div>

<div>Abhinav Vishnu - Pacific Northwest National Laboratory</div>

<div>Roel Wuyts - Intel ExaScience Lab</div>

<div><br>

</div>

<div>MORE INFORMATION</div>

<div>See https://sites.google.com/site/ftxsworkshop/home/ftxs2014 and</div>

<div>http://2014.dsn.org/ for more information.</div>

<div>

<div class="BodyFragment"></div>

</div>

</div>

</body>

</html>