[Hpcresilience] DOE Fault Management Workshop Report - Posted

Debardeleben, Nathan A ndebard at lanl.gov
Fri Jan 25 13:42:23 MST 2013


http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/FaultManagement-wrkshpRpt-v4-final.pdf

Workshop was in June 2012, report was written around August 2012 and officially stamped and approved on the ASCR site in December 2012.

Held June 6, 2012 at the BWI Airport Marriot hotel in Maryland. The goals of this workshop were to:

  1.  Describe the required HPC resilience for critical DOE mission needs
  2.  Detail what HPC resilience research is already being done at the DOE national laboratories and is expected to be done by industry or other groups
  3.  Determine what fault management research is a priority for DOE’s Office of Science and National Nuclear Security Administration (NNSA) over the next five years
  4.  Develop a roadmap for getting the necessary research accomplished in the timeframe when it will be needed by the large computing facilities across DOE

-- Nathan

----------------------------------------------------
  Nathan DeBardeleben, Ph.D.
  Los Alamos National Laboratory
  High Perf. Computing Systems Integration (HPC-5)
  Ultra-Scale Research Center, Resilience Lead
  phone: 505-412-1069
  email: ndebard at lanl.gov
----------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://rfd.newmexicoconsortium.org/pipermail/hpcresilience/attachments/20130125/fb8a5ee6/attachment.html>


More information about the Hpcresilience mailing list