It is anticipated that in future generations of massively parallel computer systems a significant portion of processors may suffer from hardware or software faults rendering large-scale computations useless. In this work we address this problem from the algorithmic side, proposing resilient algorithms that can recover from such faults irrespective of their fault origin. In particular, we set the foundations of a new class of algorithms that will combine numerical approximations with machine learning methods. To this end, we consider three types of fault scenarios: (1) a gappy region but with no previous gaps and no contamination of surrounding simulation data, (2) a space-time gappy region but with full spatiotemporal information and no contamination, and (3) previous gaps with contamination of surrounding data. To recover from such faults we employ different reconstruction and simulation methods, namely the projective integration, the co-Kriging interpolation, and the resimulation method. In order to compare the effectiveness of these methods for the different processor faults and to quantify the error propagation in each case, we perform simulations of two benchmark flows, flow in a cavity and flow past a circular cylinder. In general, the projective integration seems to be the most effective method when the time gaps are small, and the resimulation method is the best when the time gaps are big while the co-Kriging method is independent of time gaps. Furthermore, the projective integration method and the co-Kriging method are found to be good estimation methods for the initial and boundary conditions of the resimulation method in scenario (3).
Times Cited: 0Si1873-7005