Using Machine Learning at Scale in Numerical Simulations with SmartSim
Using Machine Learning at Scale in Numerical Simulations with SmartSim: An application to ocean climate modeling
Paper Resources
- Link: https://doi.org/10.1016/j.jocs.2022.101707
- Code: https://github.com/CrayLabs/NCAR_ML_EKE
- Dataset: https://doi.org/10.5281/zenodo.4682270
Highlights
- Improves a large-scale, realistic simulation with online machine-learning inference
- Introduces a software framework for using machine learning in numerical simulations
- Cross language (Fortran, C, C++, Python) support for machine-learning interfaces
- Uses a distributed, in-memory database (Redis) for scaleable data transfer and ML inference
Abstract
We demonstrate the first climate-scale, numerical ocean simulations improved through distributed, online inference of Deep Neural Networks (DNN) using SmartSim. SmartSim is a library dedicated to enabling online analysis and Machine Learning (ML) for high performance, numerical simulations. In this paper, we detail the SmartSim architecture and provide benchmarks including online inference with a shared ML model, EKE-ResNet, on heterogeneous HPC systems. We demonstrate the capability of SmartSim by using it to run a 12-member ensemble of global-scale, high-resolution ocean simulations, each spanning 19 compute nodes, all communicating with the same ML architecture at each simulation timestep. In total, 970 billion inferences are collectively served by running the ensemble for a total of 120 simulated years. The inferences are used to predict the oceanic eddy kinetic energy (EKE), which is a variable that is used to tune different turbulence closures in the model and thus directly affects the simulation. The root-mean-square of the error in EKE (as compared to an eddy-resolving simulation) is 20% lower when using the ML-prediction than the previous state of the art. This demonstration is an example of how machine learning methods can be integrated into traditional numerical simulations, replace prognostic equations, and preserve overall simulation stability without significantly affecting the time to solution.