Zobrazeno 1 - 10
of 11
pro vyhledávání: '"Nuria Losada"'
Publikováno v:
IEEE Transactions on Parallel and Distributed Systems. 33:1856-1872
Autor:
Keita Teranishi, Patricia González, George Bosilca, Aurelien Bouteiller, Nuria Losada, María Martín
Publikováno v:
RUC: Repositorio da Universidade da Coruña
Universidade da Coruña (UDC)
RUC. Repositorio da Universidade da Coruña
instname
Universidade da Coruña (UDC)
RUC. Repositorio da Universidade da Coruña
instname
[Abstract] The growth in the number of computational resources used by high-performance computing (HPC) systems leads to an increase in failure rates. Fault-tolerant techniques will become essential for long-running applications executing in future e
Publikováno v:
FTXS@SC
With the increase in scale and architectural complexity of supercomputers, the management of failures has become integral to successfully executing a long-running high-performance computing application. In many instances, failures have a localized sc
Publikováno v:
RUC. Repositorio da Universidade da Coruña
Universitat Oberta de Catalunya (UOC)
instname
Universitat Oberta de Catalunya (UOC)
instname
[Abstract] Heterogeneous systems have increased their popularity in recent years due to the high performance and reduced energy consumption capabilities provided by using devices such as GPUs or Xeon Phi accelerators. This paper proposes a checkpoint
Publikováno v:
RUC: Repositorio da Universidade da Coruña
Universidade da Coruña (UDC)
RUC. Repositorio da Universidade da Coruña
instname
Universidade da Coruña (UDC)
RUC. Repositorio da Universidade da Coruña
instname
[Abstract] The resilience approach generally used in high-performance computing (HPC) relies on coordinated checkpoint/restart, a global rollback of all the processes that are running the application. However, in many instances, the failure has a mor
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::011b4fcf909698e01b3ce01e9af17ca5
http://hdl.handle.net/2183/27584
http://hdl.handle.net/2183/27584
Publikováno v:
ICCS
As parallel machines increase their number of processors, so does the failure rate of the global system, thus, long-running applications will need to make use of fault tolerance techniques to ensure the successful execution completion. Most of curren
Publikováno v:
HPCS
Current petascale systems, formed by hundreds of thousands of cores, are highly dynamic, which causes that hardware failure rates are relatively high. Failure data collected from two large high-performance computing sites have been analysed in [1], s
Publikováno v:
UPCommons. Portal del coneixement obert de la UPC
Universitat Politècnica de Catalunya (UPC)
FTXS@SC
Recercat. Dipósit de la Recerca de Catalunya
instname
Universitat Politècnica de Catalunya (UPC)
FTXS@SC
Recercat. Dipósit de la Recerca de Catalunya
instname
The coming exascale era is a great opportunity for high performance computing (HPC) applications. However, high failure rates on these systems will hazard the successful completion of their execution. Bit-flip errors in dynamic random access memory (
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::576ce18667b1ef900caf9daaef5cbc86
Publikováno v:
RUC. Repositorio da Universidade da Coruña
instname
instname
This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-016-1629-7 [Abstract] Future exascale systems, formed by mil
Externí odkaz:
https://explore.openaire.eu/search/publication?articleId=doi_dedup___::69b8e09994e9f5efc41650d7cc2ca7d4
http://hdl.handle.net/2183/20890
http://hdl.handle.net/2183/20890
Publikováno v:
PDP
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. Check pointing is one of the most popular fault tolerance techniques. However, check pointing co