fig5
Figure 5. Data sampling and model construction for the Ma-PCF dataset. (A) Exploration of Ma-PCF data subsets and algorithm combinations, where the white dots represent each data subset and its corresponding best algorithm combination, the red dot indicates the model derived from the optimal data subset and algorithm combination, and the black, blue, and red lines represent the baseline model’s 10-fold accuracy, the data elimination process, and the addition process, respectively; (B) Performance of six algorithms shown in terms of errors on the validation set and test set; (C) Confusion matrices for the best generalization model, CBC, across the training, validation, and test sets; (D) ROC curve for the CBC model and the corresponding AUC values for different categories. CBC: Gradient boosting classification; ROC: receiver operating characteristic; AUC: area under the curve.