Domain knowledge-guided interpretive machine learning: formula discovery for the oxidation behavior of ferritic-martensitic steels in supercritical water
*Correspondence to: Prof. Tong-Yi Zhang, Materials Genome Institute, Shanghai University, 333 Nanchen Road, Shanghai 200444, China. E-mail:
A general formula with high generalization and accurate prediction power is highly desirable for science, technology and engineering. In addition to human beings, artificial intelligence algorithms show great promise for the discovery of formulas. In this study, we propose a domain knowledge-guided interpretive machine learning strategy and demonstrate it by studying the oxidation behavior of ferritic-martensitic steels in supercritical water. The oxidation Cr equivalent is, for the first time, proposed in the present work to represent all contributions of alloying elements to oxidation, derived by our domain knowledge and interpretive machine learning algorithms. An open-source tree classifier for linear regression algorithm is also, for the first time, developed to materialize the formula with collected data. This algorithm effectively captures the linear correlation between compositions, testing environments and oxidation behaviors from the data. The sure independence screening and sparsifying operator algorithm finally assembles the information derived from the tree classifier for linear regression algorithm, resulting in a general formula. The general formula with the determined parameters has the power to predict, quantitatively and accurately, the oxidation behavior of ferritic-martensitic steels with multiple alloying elements exposed to various supercritical water environments, thereby providing guidance for the design of anti-oxidation steels and hence promoting the development of power plants with improved safety. The present work demonstrates the power of domain knowledge-guided interpretive machine learning with respect to the data-driven discovery of physics-informed formulas and the acceleration of materials informatics development.
The rapid development of materials informatics [1-5], artificial intelligence (AI) and machine learning (ML) techniques has led to a new paradigm of data-driven discovery of novel materials, state-of-the-art experimental and computational methods and scientific laws and formulas. The number of publications on materials informatics has increased exponentially in the past decade and materials informatics has achieved great success in many areas [6-10]. For example, Xue et al.  proposed an adaptive design iteration strategy by tightly coupling ML with experiments, which sequentially identifies the next experiments by using efficient global optimization to balance the trade-off between exploitation and exploration. This adaptive design strategy, also known as active learning, starts from an initial dataset of 22 alloys, runs nine feedback loops in the search space of
Attia et al.  developed an ML methodology to efficiently optimize a parameter space specifying the current and voltage profiles of six-step, 10-min fast-charging protocols for maximizing battery cycle life. They trained an elastic net ML model to predict battery charging/discharging life using data only from the first few cycles and employed a Bayesian optimization algorithm to reduce the number of experiments by balancing exploration and exploitation to efficiently probe the parameter space of charging protocols. With such an approach, they identified and validated high-cycle-life charging protocols among 224 candidates in 16 days. Saito et al.  conducted an image process by using U-Net based on a convolutional encoder-decoder network to segment and identify the thickness of atomic layer flakes from optical microscopy images, achieving a success rate of 70–80% in distinguishing monolayer and bilayer MoS2 and graphene.
ML is achieving remarkable success in materials science and engineering [15, 16] and will achieve even greater success if it can become more transparent and interpretive. Theoretically, AI and ML are based on statistics without utilizing any other scientific laws, principles and (physical) equations and most AI and ML algorithms perform as "black-box" systems [17-23]. Considerable efforts, such as physics-informed neural networks , symbolic regression and Shapley additive explanations (SHAP) , are being carried out to enhance the interpretability of ML models. Obviously, significant further endeavors are required to make ML models interpretive. The strategy proposed in the present work, i.e., domain knowledge-guided interpretive ML, might pave the way for the discovery of mathematical formulas.
In the present work, we propose a domain knowledge-guided interpretive ML strategy to make ML models interpretable and have more physical sense and apply this strategy to the data-driven discovery of formulas regarding the oxidation of ferritic-martensitic (FM) steels in supercritical water (SCW). Although the use of SCW in power plants can achieve enhanced thermal efficiency with simplified plant design and improved safety, it requires high anti-oxidation materials because SCW is a strong oxidant  beyond the supercritical point (at 374 ℃ and 22.1 MPa). FM steels are some of the most promising structural materials for use in SCW-cooled power plants, owing to their high elevated temperature strength, high creep resistance, high thermal conductivity, low swelling behavior under irradiation, low thermal expansion coefficient, and low susceptibility to stress oxidation cracking up to 600 ℃ [27, 28].
The oxidation behavior of FM steels in SCW environments has been investigated extensively through experimental approaches [29-34]. The current understanding of the corrosion occurring in high-temperature water environments is associated with the chemistry and physics of the water (density and dielectric constant of the medium). In high-temperature water with a low density/dielectric constant (
Significant progress has been achieved in the investigation and understanding of FM steel oxidation in various SCW environments [29, 30, 37-39], as evidenced by the Arrhenius equation of
Oxidation is clearly a thermally activated process, and the associated thermodynamics and kinetics are greatly dependent on the material compositions and environmental variables. In experimental investigations, individual researchers adjust only one or a few experimental conditions and the obtained result and formula are valid only for the FM steels and SCW environments and periods tested. To the best of our knowledge, no generalized formula has been established for the description and/or prediction of the oxidation of FM steels with any given alloying elements exposed to various SCW environmental conditions. The present work adopts domain knowledge-guided interpretive ML to discover a generalized formula for the oxidation of FM steels in SCW, which will promote the development of green and safe power plants. In addition to exposure time, the investigated FM steels cover 11 alloying elements, and the studied SCW environments include temperature, dissolved oxygen concentration (DOC) and pressure.
Our domain knowledge of oxidation suggests a dimensionless Arrhenius equation of
The generalized formula established in this study accurately predicts the oxidation behavior of experimental FM steels with different alloying elements in various SCW testing conditions. Figure 1 outlines the domain knowledge-guided interpretive ML strategy, where the hub is the domain knowledge suggested Arrhenius equation. The feature importance of SHAP
Figure 1. Domain knowledge-guided interpretive ML strategy. A: Feature selection with
A total of 184 oxidation data of FM steels in SCW are collected from the literature and given at the online Supplementary Information. Every datum in the FM steel oxidation (FMO) dataset includes oxidation caused weight gain in units of mg/dm
Fifteen features of FM steel oxidation data
|Alloying elements||Cr||Chromium (wt.%)|
|Testing conditions||T||Absolute temperature (K)|
|Pressure||SCW pressure (MPa)|
|t||Exposure time (h)|
|DOC||Dissolved oxygen concentration (ppb)|
Xgboost and SHAP values
Xgboost  is a powerful tree-based boosting ensemble algorithm. The present work employs the Xgboost algorithm to regress the oxidation data of FM steels in SCW, and the values of hyperparameters involved are optimized by cross-validation and/or a grid search in the hyperparameter space with the open python library scikit-learn . Table S1 at Section 2.1 of the Supplementary Information lists all optimized values of the hyperparameters.
The SHAP algorithm is developed based on game theory . A SHAP value, whether positive or negative, reflects the contribution of a feature to a predicted response in one datum, and the predicted response is given by an ML model. In the present work, SHAP values are calculated with Xgboost models. If there are
where the SHAP value
Integration of SHAP values with domain knowledge
The game theory-based SHAP value is an additive feature attribution method, where the output is a sum of contributions of each input feature . If the contributions of variables to a function are not additive in the original variable space, but additive in a mapped space, the SHAP value will be calculated in the mapped space. For the oxidation of FM steels in SCW, there are 15 features and each feature contributes to the oxidation weight gain
It might be inaccurate to calculate the reasonable SHAP values in the original space. Based on the domain knowledge of oxidation, we take a dimensionless Arrhenius equation of
Feature ranking, selection and data screening
An Xgboost model is first developed with all features via ten-fold cross-validation (10-CV) and is used to calculate all SHAP values
The data screening is carried out by evaluating the errors
RESULTS AND DISCUSSION
Feature selection and data screening
A total of 184 data on FM steel oxidation in SCW are collected from the literature and provided in the Supplementary Information. Fifteen features are employed here and categorized into two groups, namely, alloying elements and testing conditions. The feature analysis is carried out within each of the groups. The SHAP values
Figure 2. SHAP analysis of features. A-D: SHAP values of testing conditions, i.e., temperature, time, DOC and pressure. E-I: SHAP values of alloy compositions, i.e., V, Si, Cr, Ni and Mn. J: Feature importance ranking by
As expected, Figure 2A shows that the lower the value of
The SHAP value of
It is somewhat surprising that vanadium plays the most important role among the studied 11 alloying elements in the oxidation of FM steels in SCW, as shown in Figure 2E, where the SHAP value of
The feature importance defined in the SHAP method (see Methods),
The feature selection is then conducted by the sequential backward selector wrapped with Xgboost and 10-CV, which yields the three testing features of
Pure SHAP values and oxidation Cr equivalent
The SHAP value of each feature is decomposed into its pure SHAP value and the interaction SHAP values, as stated in Eq. (1b). Figure 3A-H shows the pure SHAP values of the selected eight features and there are 178 pure SHAP values in each figure. A comparison of Figure 3A-H to the corresponding Figure 2A-C and Figure E-I indicates that for a certain feature value, the pure SHAP value scattering is much smaller than the SHAP values. This is an expected result because a pure SHAP value eliminates all interaction SHAP values from its parent SHAP value. If pure SHAP values are calculated from a perfect model of a single function of variables, the pure SHAP value of a feature will correspond to the feature value one-to-one, i.e., for one feature value, there is only one pure SHAP value.
Figure 3. Pure SHAP value analysis of important features. A-C: Pure SHAP values of testing conditions, i.e., temperature, time and DOC. D-H: Pure SHAP values of alloy compositions, i.e., V, Ni, Cr, Si and Mn.
Figure 3A shows almost ideal pure SHAP values of feature
There are a few reasons causing multiple pure SHAP values at a given feature. The first reason might be experimental errors, which measure the degree of the experimental scattering of repeated tests. The second reason might be attributed to the Xgboost model, which approximately estimates the response from the input feature data rather than a perfect function. The third reason might be the method used to calculate the SHAP values from a tree-based algorithm (Tree-Explainer model). Figure 3C shows that the pure SHAP values of the DOC feature can be expressed by a logarithm function of
From the pure SHAP value of an individual feature, we defined the joint SHAP value of two features as
which measures the joint contribution of two features. In general,
Figure 4. Joint SHAP value of two features and the derived oxidation Cr equivalent concentration. A-D: Joint SHAP value of two features, i.e., Cr with Si, Mn, Ni and V, respectively. E: Predicted values of Xgboost model versus experimental values with the transferred four features. F: Pure SHAP analysis of oxidation Cr equivalent concentration feature.
Hereafter, we use one feature of the oxidation Cr equivalent concentration to replace the five element features. Thus, the total number of features is reduced to four, one chemical composition feature and three testing condition features. With the four features, the Xgboost model is retrained with 10-CV. The predictions on the 178 data are plotted in Figure 4E, showing a perfect fitting with
Activation energy and time exponents
The oxidation mechanism of FM steels in SCW is embodied in the activation energy and time exponent [29, 46], which are the coefficients of
The TCLR tree must be pruned, otherwise, many leaves contain only two data per leaf, which destroys the model generalization considerably. A threshold of data number might also be introduced to prune the tree, i.e., the amount of data in a leaf should not be smaller than a pre-set threshold (default minsize
The entire feature space is estimated from the regions of the four features, i.e.,
Figure 5. Data located on the leaf of TCLR. (A), (B) One passed leaf and one failed leaf on TCLR of activation energy. (C), (D) One passed leaf and one failed leaf on TCLR of time exponents.
Figure 5A shows one passed leaf for
The TCLR yields the values of activation energy Q and time exponent
The SISSO algorithm , with minimization of mean absolute percentage error (MAPE, see Supplementary Information) is carried out to find analytic expressions of activation energy Q and time exponent
Oxidation kinetic equations
As mentioned above, the pure SHAP values of the DOC feature suggest that the oxidation kinetic in logarithmic space yield in the form of
To have an analytic expression of
Putting all the analytic expressions together gives
The analytic formula of Eq. (5d) has strong predictive power, as shown in Figure 6, with a fitting performance of
Figure 6. Predicted values of Eq. 5(d) versus experimentally measured values. Each dot represents an FM sample, and the dots on the dotted line indicate the equation predicted values are consistent with experimental observations. The light blue region covers
In this study, we develop a domain knowledge-guided interpretive ML strategy and demonstrate it by the discovery of the generalized formula for FM steel oxidation in SCW. The domain knowledge suggests the generalized Arrhenius oxidation formula of
The oxidation chromium equivalent concentration
The developed TCLR algorithm is scientifically significant to materials informatics because it captures linear relationships between tasks and features. It is expected that when an original feature space is mapped to a high-dimensional space, the TCLR algorithm is able to capture linear relationships in the high-dimensional space. More affords are needed to further develop the TCLR algorithm.
The generalized Arrhenius oxidation formula has very high prediction accuracy with a Pearson correlation coefficient
Performed the research, analysed data, wrote the programmers, and drafted the manuscript: Cao B
Collected the data. Tong-Yi Zhang and Ziqiang Dong supervised the project: Yang S, Sun A
Studied and revised the manuscript more on the oxidation mechanism: Dong Z
Designed the study, performed the research, analysed data, revised, and finalized the manuscript: Zhang TY
Discussed the results: Cao B, Yang S, Sun A, Dong Z, Zhang TY
Availability of data and materials
All experimental data collected in the study are contained in Supplementary Information (FM steel oxidation dataset) and are also available at: https://github.com/Bin-Cao/TCLRmodel. The ML methodology described in the present work was implemented in Python. Source codes of the programmers and algorithms are available at: https://github.com/Bin-Cao/TCLRmodel.
Financial support and sponsorship
This work was sponsored by the National Key Research and Development Program of China (No. 2018YFB0704400), Key Program of Science and Technology of Yunnan Province (No. 202002AB080001-2), Key Research Project of Zhejiang Laboratory (No. 2021PE0AC02), and Shanghai Pujiang Program (Grant No. 20PJ1403700).
Conflicts of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Ethical approval and consent to participate
Consent for publication
© The Author(s) 2022.
1. Wei QH, Xiong J, Sun S, Zhang T-Y. Multi-objective machine learning of four mechanical properties of steels. Sci Sin -Tech 2021;51:722-36.
2. Xiong J, Zhang T-Y, Shi S. Machine learning of mechanical properties of steels. Sci China Technol Sci 2020;63:1247-55.
3. Leitherer A, Ziletti A, Ghiringhelli LM. Robust recognition and exploratory analysis of crystal structures via Bayesian deep learning. Nat Commun 2021;12:6234.
4. Sun S, Ouyang R, Zhang B, Zhang T-Y. Data-driven discovery of formulas by symbolic regression. MRS Bull 2019;44:559-64.
5. Xiong J, Shi S, Zhang T-Y. Machine learning of phases and mechanical properties in complex concentrated alloys. Journal of Materials Science & Technology 2021;87:133-42.
6. Xie SR, Quan Y, Hire AC, et al. Machine learning of superconducting critical temperature from Eliashberg theory. npj Comput Mater 2022;8.
7. Levämäki H, Tasnádi F, Sangiovanni DG, Johnson LJS, Armiento R, Abrikosov IA. Predicting elastic properties of hard-coating alloys using ab-initio and machine learning methods. npj Comput Mater 2022;8.
8. Roy Chowdhury P, Ruan X. Unexpected thermal conductivity enhancement in aperiodic superlattices discovered using active machine learning. npj Comput Mater 2022;8.
9. Zhu YQ, Xu T, Wei Q, et al. Linear-superelastic Ti-Nb nanocomposite alloys with ultralow modulus via high-throughput phase-field design and machine learning. npj Comput Mater 2021;7.
10. Wang JH, Jia J, Sun S, Zhang T-Y. Statistical learning of small data with domain knowledge-sample size-and pre-notch length- dependent strength of concrete. Engineering Fracture Mechanics 2022;259:108160.
11. Xue D, Balachandran PV, Hogden J, Theiler J, Xue D, Lookman T. Accelerated search for materials with targeted properties by adaptive design. Nat Commun 2016;7:11241.
12. Fung V, Hu G, Ganesh P, Sumpter BG. Machine learned features from density of states for accurate adsorption energy prediction. Nat Commun 2021;12:88.
13. Attia PM, Grover A, Jin N, et al. Machine learned features from density of states for accurate adsorption energy prediction. Nat Commun 2021;12:88.
14. Saito Y, Shin K, Terayama K, et al. Deep-learning-based quality filtering of mechanically exfoliated 2D crystals. npj Comput Mater 2019;5.
15. Li X, Zhao J, Cong J, et al. Machine learning guided automatic recognition of crystal boundaries in bainitic/martensitic alloy and relationship between boundary types and ductile-to-brittle transition behavior. Journal of Materials Science & Technology 2021;84:49-58.
16. Dai F, Wen B, Sun Y, Xiang H, Zhou Y. Theoretical prediction on thermal and mechanical properties of high entropy (Zr0.2Hf0.2Ti0.2Nb0.2Ta0.2)C by deep learning potential. Journal of Materials Science & Technology 2020;43:168-74.
18. Lookman T, Balachandran PV, Xue D, Yuan R. Active learning in materials science with emphasis on adaptive sampling using uncertainties for targeted design. npj Comput Mater 2019;5.
19. Wen C, Zhang Y, Wang C, et al. Machine learning assisted design of high entropy alloys with desired property. Acta Materialia 2019;170:109-17.
20. Balachandran PV, Kowalski B, Sehirlioglu A, Lookman T. Experimental search for high-temperature ferroelectric perovskites guided by two-step machine learning. Nat Commun 2018;9:1668.
21. Yan L, Diao Y, Lang Z, Gao K. Corrosion rate prediction and influencing factors evaluation of low-alloy steels in marine atmosphere using machine learning approach. Sci Technol Adv Mater 2020;21:359-70.
22. Jablonka KM, Jothiappan GM, Wang S, Smit B, Yoo B. Bias free multiobjective active learning for materials design and discovery. Nat Commun 2021;12:2312.
23. Garrido Torres JA, Gharakhanyan V, Artrith N, Eegholm TH, Urban A. Augmenting zero-Kelvin quantum mechanics with machine learning for the prediction of chemical reactions at high temperatures. Nat Commun 2021;12:7012.
24. Lu L, Meng X, Mao Z, Karniadakis GE. DeepXDE: A deep learning library for solving differential equations. SIAM Rev 2021;63:208-28.
25. Lundberg SM, Lee SI. A unified approach to interpreting model predictions, Proceedings of the 31st international conference on neural information processing systems, 2017, pp. 4768-4777. Available from: https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html [Last accessed on 20 Apr 2022].
26. Schulenberg T, Leung LK, Oka Y. Review of R & D for supercritical water cooled reactors. Progress in Nuclear Energy 2014;77:282-99.
27. Zhong X, Wu X, Han E. Effects of exposure temperature and time on corrosion behavior of a ferritic-martensitic steel P92 in aerated supercritical water. Corrosion Science 2015;90:511-21.
28. Klueh R, Nelson A. Ferritic/martensitic steels for next-generation reactors. Journal of Nuclear Materials 2007;371:37-52.
29. Ampornrat P, Was GS. Oxidation of ferritic-martensitic alloys T91, HCM12A and HT-9 in supercritical water. Journal of Nuclear Materials 2007;371:1-17.
30. Li Y, Xu T, Wang S, et al. Modelling and Analysis of the Corrosion Characteristics of Ferritic-Martensitic Steels in Supercritical Water. Materials (Basel) 2019;12:409.
31. Tan L, Ren X, Allen T. Corrosion behavior of 9-12% Cr ferritic-martensitic steels in supercritical water. Corrosion Science 2010;52:1520-8.
32. Li H, Cao Q, Zhu Z. High temperature oxidation behavior of ferritic steel in supercritical water at 550-700 ℃. Materials at High Temperatures 2018;36:111-6.
33. Zhu Z, Xu H, Jiang D, Mao X, Zhang N. Influence of temperature on the oxidation behaviour of a ferritic-martensitic steel in supercritical water. Corrosion Science 2016;113:172-9.
34. Li Y, Wang S, Sun P, et al. Investigation on early formation and evolution of oxide scales on ferritic-martensitic steels in supercritical water. Corrosion Science 2018;135:136-46.
35. Liu Z. Corrosion behavior of designed ferritic-martensitic steels in supercritical water Canada: ProQuest Dissertations Publishing; 2013.
36. Dong Z, Li M, Behnamian Y, et al. Effects of Si, Mn on the corrosion behavior of ferritic-martensitic steels in supercritical water (SCW) environments. Corrosion Science 2020;166:108432.
37. Sun L, Yan W. Estimation of oxidation kinetics and oxide scale void position of ferritic-martensitic steels in supercritical water. Advances in Materials Science and Engineering 2017;2017:1-12.
38. Bischoff J, Motta AT. Oxidation behavior of ferritic-martensitic and ODS steels in supercritical water. Journal of Nuclear Materials 2012;424:261-76.
39. Zhang N, Xu H, Li B, Bai Y, Liu D. Influence of the dissolved oxygen content on corrosion of the ferritic-martensitic steel P92 in supercritical water. Corrosion Science 2012;56:123-8.
40. Chen T, Guestrin C. Xgboost: A scalable tree boosting system; proceedings of the Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, F, 2016.
41. Ouyang R, Curtarolo S, Ahmetcik E, Scheffler M, Ghiringhelli LM. SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys Rev Materials 2018;2.
42. Zhang TY, Cao B, Zhang SY, Sun S. Tree-classifier for linear regression software [No. 2021SR1951267], 2021. Available from: https://register.ccopyright.com.cn/ [Last accessed on 20 Apr 2022].
43. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in python, the journal of machine learning research 12 (2012) 2825-2830. Available from: https://www.jmlr.org/papers/volume12/pedregosa11a/pedregosa11a.pdf?ref=https://githubhelp.com [Last accessed on 20 Apr 2022].
44. Lundberg SM, Erion GG, Lee SI. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv 2018; 1802.03888.
45. Uusitalo M, Vuoristo P, Mäntylä T. High temperature corrosion of coatings and boiler steels below chlorine-containing salt deposits. Corrosion Science 2004;46:1311-31.
Cite This Article
Cao B, Yang S, Sun A, Dong Z, Zhang TY. Domain knowledge-guided interpretive machine learning: formula discovery for the oxidation behavior of ferritic-martensitic steels in supercritical water. J Mater Inf 2022;2:4. http://dx.doi.org/10.20517/jmi.2022.04
Cao B, Yang S, Sun A, Dong Z, Zhang TY. Domain knowledge-guided interpretive machine learning: formula discovery for the oxidation behavior of ferritic-martensitic steels in supercritical water. Journal of Materials Informatics. 2022; 2(2): 4. http://dx.doi.org/10.20517/jmi.2022.04
Cao, Bin, Shuang Yang, Ankang Sun, Ziqiang Dong, Tong-Yi Zhang. 2022. "Domain knowledge-guided interpretive machine learning: formula discovery for the oxidation behavior of ferritic-martensitic steels in supercritical water" Journal of Materials Informatics. 2, no.2: 4. http://dx.doi.org/10.20517/jmi.2022.04
Cao, B.; Yang S.; Sun A.; Dong Z.; Zhang T.Y. Domain knowledge-guided interpretive machine learning: formula discovery for the oxidation behavior of ferritic-martensitic steels in supercritical water. J. Mater. Inf. 2022, 2, 4. http://dx.doi.org/10.20517/jmi.2022.04
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at firstname.lastname@example.org.