Kinship classification from a machine learning perspective: a pilot study based on genotyping data

Fanzhang Lei; Xiaolian Wu; Qinglin Liu; Tong Xie; Bofeng Zhu

doi:10.20517/jtgg.2025.109

Download PDF

Original Article | Open Access | 1 Apr 2026

Kinship classification from a machine learning perspective: a pilot study based on genotyping data

Views: 18 | Downloads: 1 | Cited:

0

Fanzhang Lei¹

,

Xiaolian Wu¹

, ...

Bofeng Zhu^1,2,3

J Transl Genet Genom. 2026;10:119-39.

10.20517/jtgg.2025.109 | © The Author(s) 2026.

Author Information

Article Notes

Cite This Article

Abstract

Aim: Kinship analysis in trace amounts and degraded biological samples has consistently posed a challenge in forensic practice. With shorter amplicons and no stutter peak, Insertion/Deletion polymorphisms (InDels) significantly improve kinship analyses of deceased individuals and their potential living relatives. However, room for improvement remains in identifying 2nd-degree and more distant kinships. To address this issue, a kinship analysis workflow based on machine learning (ML) models was proposed.

Methods: Based on multiple kinship parameters including identity-by-state (IBS) scores, k coefficients, proportion identity-by-descent (IBD), and likelihood ratio (LR) values, this pilot study applied a recently validated InDel locus to preliminarily develop an ML workflow for forensic kinship multi-classification.

Results: In the binary classification of 2nd-degree relatives and unrelated pairs, the LR cutoff threshold workflow and the ML workflow achieved a similar accuracy of 0.9194. However, the ML method had a conclusiveness rate (CR) of 1.0, compared to 0.7066 for the LR workflow. In the multiclass task, the LR-based workflow had a macro F1 score of 0.6955/0.5212 and a CR of 0.7375/0.7046 for single and dual thresholds methods, respectively. However, the ML-based workflow showed that the optimal model - feature combination (XGBoost-IBD+LR) could classify all samples conclusively, with a macro F1 score of 0.9020.

Conclusion: In summary, the ML workflow enhanced the kinship analysis efficiency based on the InDel genotyping system by combining multiple parameters, aiming to provide a more flexible and efficient solution for large-scale database screening.

Graphical Abstract

Keywords

Insertion/Deletion polymorphism, capillary electrophoresis, kinship classification, machine learning, population genetics

Download PDF 0 0

INTRODUCTION

Forensic kinship analysis is vital and challenging for disaster victim identification and cold case investigation, particularly when dealing with highly degraded biomaterials. However, despite the strong polymorphism and widespread adoption of the short tandem repeat capillary electrophoresis as the standard approach for kinship analysis, its effectiveness is limited by longer amplicon lengths. Highly degraded biomaterials encountered at forensic scenes may result in incomplete or unreliable genotyping and artifacts. Characterized by its short amplicon, low mutation rate, and absence of stutter peaks, Insertion/Deletion polymorphism (InDel) has garnered increasing attention from forensic researchers^[1-3]. Featuring smaller amplicons (< 230 bp) of the selected 57 autosomal InDel loci, the AGCU-60 InDel loci has well validated its efficacy for paternity testing in a series of studies based on numerous populations^[4-9], where a complete variant call could be obtained from samples with a minimum DNA input of 125 pg. However, InDel loci, as exemplified by this panel, face challenges in kinship analysis. When analyzing highly degraded samples, prior knowledge about the genetic background of the unknown donor is often unavailable. Consequently, the only feasible approach is to conduct large-scale database matching to identify potential relatives.

In this context, kinship parameters including identity-by-state (IBS) score, method of moment (MoM), and likelihood ratio (LR) are commonly applied. The IBS score^[10], assessing the similarity of DNA segments or alleles between two individuals, can be directly computed by the shared number of IBS alleles without a priori knowledge of allele frequencies and linkage disequilibrium status in the population. However, it is accompanied by drawbacks such as the need for non-generalized discriminant thresholds and a relatively low strength of evidence. MoM, grounded in the observed genotyping profile of pairwise individuals, estimates parameters such as actual kinship coefficients and k coefficients^[11]. This approach simultaneously considers the count of IBS and (or) allele frequencies, yielding a higher informativeness. However, no standardized evidence interpretation system has been established on this method so far, and it is less effective at identifying 3rd-degree and more distant kinships^[11,12]. The LR method is currently recommended by the International Society for Forensic Genetics^[13], necessitating a priori statement of specific kinship as alternate hypotheses and the null hypothesis of unrelatedness between the two individuals. Based on the two exclusive hypotheses, LR compares the conditional probabilities of the two hypotheses. Its robustness and efficiency have been validated in numerous studies^[9,14-16]. In this context, the mainstream LR cutoff threshold-based approach provides kinship identification results based on two specific hypotheses: a particular type of kinship relationship as the alternate hypothesis and unrelated individuals (UNs) as the null hypothesis. When analyzing highly degraded samples with unknown genetic backgrounds based on InDels, this requires evaluating all possible hypotheses, resulting in the accumulation of errors in the final conclusions. Considering the inherent limitation imposed by the bi-allelic nature of InDel loci, the accumulation of errors may compromise the reliability of kinship analysis.

To address these challenges, this study proposes introducing machine learning (ML) models and their evaluation systems as a novel method for interpreting kinship evidence. By eliminating the need to assume a specific kinship, this approach aims to provide a more flexible and efficient solution for large-scale database screening. In this study, a series of ML classifiers with different complexities were introduced into the InDel-based kinship classification: namely the multinomial logistic regression (MLR), multinomial Naïve Bayes (MNB), support vector machine (SVM), random forest classifier (RFC), extreme gradient boosting (XGBoost, XGB)^[17], light gradient boosting machine (LightGBM, LGBM)^[18] and categorical boosting (CatBoost, CATB)^[11]. By comparing the common workflow, this study aims to develop a preliminary ML workflow for kinship analysis based on the AGCU-60 InDel loci and serves as a pilot exploration for integration of multiple kinship parameters in the forensic kinship classification.

METHODS

Sample collection and reference data preprocessing

A total of 175 Chinese Tibetans who claimed to be unrelated within three generations in the Tibetan Autonomous Prefecture of Gannan (CTG) were introduced into the existing population dataset of AGCU-60 InDel loci. This study was approved by the Ethics Committee of Xi’an Jiaotong University Health Science Center and Southern Medical University (NO. 2019-1039). Genotyping data of different populations were integrated. All CTG samples were collected between August 2022 and October 2023. Data sources were as follows: (1) the studied CTG group; (2) 1000 Genomes Project (1KGP) Phase III expanded^[19]; (3) nine populations from the previous studies on the AGCU-60 kit^[4-9,20]. When processing the 1KGP whole-genome data, we filtered samples that may have Mendelian inheritance errors (e = 0.001) using the following PLINK v2.0 (https://www.cog-genomics.org/plink/2.0) command: “--me 0.001”.

DNA extraction, polymerase chain reaction amplification, and genotyping

Genomic DNAs from the bloodstain samples were extracted based on the Chelex-100 method. Then, the Polymerase Chain Reaction (PCR) amplification conditions of the AGCU-60 kit followed the manufacturer’s protocol and previous validation study^[5].

Statistical analysis of forensic parameters and genetic background

Common forensic parameters of the 57 autosomal InDel loci in the CTG group were calculated by STRAF v1.0.5^[21]. Linkage disequilibrium on all pairwise loci from the 57 InDel loci was tested by Arlequin v3.5^[22]. Detailed information is demonstrated in Figure 1, Supplementary Tables 1 and 2. Using PLINK v2.0 (https://www.cog-genomics.org/plink/2.0), we screened all samples unrelated within three generations with the command “--king-cutoff 0.0625”, to obtain robust allele frequencies in the population. We then merged these samples separately into a global dataset (n = 5201) and an East Asian dataset (n = 3200) according to biogeographical origins. To assess the phylogenetic relationships among the 36 global populations, the merged .vcf files were used for the Treemix analysis (https://bioconda.github.io/recipes/treemix/README.html), with the root population set as Yoruba in Ibadan, Nigeria (YRI). Finally, ADMIXTURE analyses (https://dalexander.github.io/admixture/) set at different numbers of assumed ancestry components (K = 2-10) were performed and the optimal K value was confirmed under cross-validation. The East Asian dataset was used to perform the forensic kinship analyses and the development of ML models. Genotype data of the global dataset was applied to calculate Nei’s Genetic Distance (DA) values between paired populations. Principal Component Analysis (PCA) analyses were separately performed on the allelic frequency and genotyping data of the global dataset. A locus-by-locus analysis of molecular variance was also performed to obtain the overall Fixation Index within Subpopulations (F_IS), Fixation Index among Subpopulations (F_ST), and Total Fixation Index (F_IT) values of the East Asian dataset, to reveal the latent genetic substructure within East Asian populations.

Kinship classification from a machine learning perspective: a pilot study based on genotyping data

Figure 1. Geographical distributions of the 35 global populations and the studied CTG group. Populations labeled as 1, 2, and 3 are, respectively, from datasets of this study, the 1000 Genomes Project Phase III, and previous AGCU-60 InDel kit studies. The map is open access at the site of the QGIS Geographic Information System (https://www.qgis.org/). Copyright © 1989, 1991 Free Software Foundation, Inc. QGIS: Quantum Geographic Information System; ACB: African Caribbean in Barbados; ASW: African Ancestry in Southwest US; ESN: Esan in Nigeria; GWD: Gambian in Western Division, The Gambia; LWK: Luhya in Webuye, Kenya; MSL: Mende in Sierra Leone; YRI: Yoruba in Ibadan, Nigeria; CLM: Colombian in Medellin, Colombia; MXL: Mexican Ancestry in Los Angeles, California; PEL: Peruvian in Lima, Peru; PUR: Puerto Rican in Puerto Rico; CDX: Chinese Dai in Xishuangbanna, China; CDXG: Chinese Dongxiang in Gansu, China; CHB: Han Chinese in Bejing, China; CHC: Chinese Han in Chengdu, China; CHG: Chinese Han in Guangdong, China; CHH: Chinese Han in Hunan, China; CHHN: Chinese Han in Hainan, China; CHS: Southern Han Chinese, China; CHY: Chinese Hani in Yunnan, China; CLH: Chinese Li in Hainan, China; CMY: Chinese Miao in Yunnan, China; CTG: Chinese Tibetan in Tibetan Autonomous Prefecture of Gannan, China; JPT: Japanese in Tokyo, Japan; KHV: Kinh in Ho Chi Minh City, Vietnam; SHP: Dingjie Sherpa, China; CEU: Utah residents with Northern and Western European ancestry; FIN: Finnish in Finland; GBR: British in England and Scotland; IBS: Iberian populations in Spain; TSI: Toscani in Italy; BEB: Bengali in Bangladesh; GIH: Gujarati Indian in Houston, TX; ITU: Indian Telugu in the UK; PJL: Punjabi in Lahore, Pakistan; STU: Sri Lankan Tamil in the UK.

Calculation of multiple kinship parameters

Figure 2 outlines the workflow for forensic kinship classification in this study. After confirming the absence of significant genetic substructure among 15 East Asian populations, 3,201 individuals from the merged dataset were used as input for pedigree simulations in Familias^[23] [Mutation options: model = equal probability (simple); rate = 0.0; range = 0.1; rate 2 = 1E^-6]. Finally, 40,000 pairs of the following relatedness were simulated: (1) parent-offspring (PO); (2) full siblings (FS); (3) 2nd-degree relatives (2ND), including an equal number of half-siblings (HS), aunt-cousins (AC) and grandchild-grandparents (GP); (4) unrelated individuals (UN).

Figure 2. Workflow for the critical evaluation of forensic efficacy and kinship classification from the ML algorithm perspective. LR: Likelihood ratio; IBD: identity-by-descent; MLR: multinomial logistic regression; SVM: support vector machine; RFC: random forest classifier; ML: machine learning; LightGBM: light gradient boosting machine; PM: probability of matching; PD: power of discrimination; PE: probability of exclusion; PIC: polymorphism information content; TPI: typical paternity index; Ho: observed heterozygosity; He: expected heterozygosity; CPE: cumulative probability of exclusion; CPD: cumulative power of discrimination; Nei’s DA: Nei’s Genetic Distance; PCA: principal component analysis; t-SNE: t-distributed stochastic neighbor embedding; CV: cross-validation; ACC: accuracy; AUC: area under the curve; AMOVA: analysis of molecular variance; XGBoost: extreme gradient boosting.

For each pair in the East Asian dataset, k coefficients include three probabilities: those for the individuals having zero, one or two pairs of IBD alleles (k₀, k₁ and k₂) coefficients were computed. Table 1 displayed the expected values of k₀, k_1, and k₂ under different genotype combinations for the tested pair of individuals. Proportion identity-by-descent (IBD) values were derived from these k coefficients using PLINK v1.9. Mathematically, the proportion IBD value is equal to twice the kinship coefficient. Under the null and alternative hypotheses of H₀ and H_i (i = 1, 2, 3), LR values for bi-allelic InDels were calculated to evaluate the underlying relationship between pairwise individuals (Eq. 1 in Supplementary Appendix A). Four different hypotheses for referring to UN, PO, FS, and 2ND relatedness are shown in Table 2. IBS scores for pairwise individuals were calculated according to Table 2. By multiplying the LR values and accumulating the IBS scores of all InDel loci in linkage equilibrium, the LR of combined LRs (CLR) and the cumulative IBS (CIBS) scores were obtained (Eq. 2 in Supplementary Appendix A).

Table 1

Expected values of k coefficient and IBD proportion for different relatedness

Type of kinship	k coefficient			IBD proportion
Type of kinship	k ₀	k ₁	k ₂	IBD proportion
Parent-offspring (PO)	0	1	0	1/2
Full siblings (FS)	1/4	1/2	1/4	1/2
2ND relatives^*	1/2	1/2	0	1/4
Unrelated (UN)	1	0	0	0

* indicates kinships within 2nd-degree, including Half-sibling (HS), Aunt-cousin (AC), and Grandchild-grandparent (GP). IBD: Identity-by-descent.

Table 2

Formulae of likelihood ratio (LR) and calculation of identity by state (IBS) score for kinship classification within 2ND-degree relatives based on different genotype pairs of biallelic InDels

Genotype 1/Genotype 2	Likelihood ratio			IBS score
Genotype 1/Genotype 2	PO/UN	FS/UN	*2ND^/UN**	IBS score
AA/AA	$$ \frac{1}{p} $$	$$ \frac{1}{4}+\frac{1}{2 p}+\frac{1}{4 p^{2}} $$	$$ \frac{1}{2}+\frac{1}{2 p} $$	2
AA/AB	$$ \frac{1}{2 p} $$	$$ \frac{1}{4}+\frac{1}{4 p} $$	$$ \frac{1}{2}+\frac{1}{4 p} $$	1
AB/AB	$$ \frac{1}{4 p q} $$	$$ \frac{1}{4}+\frac{1}{4 p q} $$	$$ \frac{1}{2}+\frac{1}{8 p q} $$	2
AA/BB	$$ \frac{\mathrm{e}}{p^{2} q^{2}} $$	$$ \frac{1}{4} $$	$$ \frac{1}{2} $$	0

p and q are the allele frequencies in each InDel locus, and the order of genotype combination is assumed to have no impact on the result, that is, AA/BB and BB/AA are identical. Mendel Inheritance Errors, such as genotype errors and de novo mutations, were considered (e = 0.001)^[24,25]. * indicates kinships of 2nd-degree, including Half-sibling (HS), Aunt-cousin (AC), and Grandchild-grandparent (GP). IBD: Identity-by-descent; FS: full siblings; PO: parent-offspring; 2ND: 2nd-degree relatives; UN: unrelated individuals.

In total, we calculated a total of 11 kinship parameters, including IBS₀-₂ (represent the number of genes when the shared number of IBS alleles is 0, 1, and 2, respectively), CIBS, k_0-2 coefficients, proportion IBD, and LRs under three hypotheses. We then merged these parameters into 7 distinct feature combinations: (1) IBS; (2) IBD; (3) LR; (4) IBS+IBD; (5) IBS+LR; (6) IBD+LR; and (7) IBS+IBD+LR (ALL). For these feature combinations, we conducted t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction for visualization.

Common and machine learning workflows for forensic kinship analysis

The 11 kinship parameters from the 40,000 simulated pairs were split into a modeling dataset (80%) and a validation set (20%) for machine learning analysis. As the log₁₀(LR) threshold method is widely used in kinship identification, we used this method as a reference methodology. The simplified decision-making process of the threshold-based classification is shown in Table 3. Based on the true relationship counts corresponding to different LR values described in Table 3, calculate the accuracy (ACC), conclusiveness rate (CR), and system power (SP). These are three different metrics used to evaluate the model, respectively describing the accuracy of kinship inference when the method can draw a clear conclusion, the rate at which clear conclusions can be drawn, and the overall system performance. The specific calculation methods are detailed in Eq. 3-5 in Supplementary Appendix A. A brief procedure was shown in Figure 2.

Table 3

Decision-making process for traditional threshold-based kinship classification

Classification threshold	Prediction		True relationship
Classification threshold	Prediction		Related	Unrelated
log₁₀(LR) ≥ T_i	The two individuals are related under certain hypothesis^*		K₁	U₁
T_i > log₁₀(LR) > T_e	Relatedness between the two individuals are not confirmed		K₂	U₂
log₁₀(LR) ≤ T_e	The two individuals are unrelated		K₃	U₃
		Total	M	N

T_i and T_e are thresholds for identifying and excluding, and the single threshold is used when T_i and T_e are the same. * indicates different alternative hypotheses of FS and 2ND relatives, including Half-sibling (HS), Aunt-cousin (AC), and Grandchild-grandparent (GP). FS: full siblings; 2ND: 2nd-degree relatives; LR: likelihood ratio.

Developing ML models for kinship classification based on LR values and different feature combinations

We developed a series of binary models based on log₁₀(LR) using Python v3.9.1, aiming to classify FS and 2ND relatives from UN pairs. After preprocessing the data, 7 types of ML models: MLR, MNB, SVM, RFC, XGB, LGBM, and CATB were developed. To balance bias and variance, the difference in accuracy between training and test sets was limited within 0.05^[24,25]. The best model for each algorithm was evaluated using 10-fold cross-validation, comparing training time and balanced ACC. The top binary models for FS and 2ND classification were selected, and their ACC values served as benchmarks for determining baseline log₁₀(LR) cutoff thresholds in a later section. For PO-UN classification, since the cumulative probability of exclusion (CPE) of this InDel loci had already been confirmed in the study and previous research^[4-9,20], we set “0.9999” as a benchmark for the subsequent methodological comparison.

Due to the homology in the computational principles of IBD and IBS, the IBS+IBD feature combination was excluded. Multiclass models based on the remaining feature combinations were also developed using the same seven algorithms. These were converted to one-vs-rest classifiers, and their 10-fold cross-validation metrics were compared.

Methodological comparison of log₁₀(LR) cutoff threshold-based and ML methods for kinship classification

To establish baseline LR thresholds for different kinship hypotheses, we developed a Python pipeline (Supplementary Appendix B). implementing both single- and dual-threshold methods. The single-threshold approach compares the fundamental classification performance of Bayesian and ML methods without considering confidence, excluding results between thresholds as “insufficient confidence”. The dual-threshold method incorporates confidence levels to evaluate overall methodological differences.

For the single-threshold LR method, we iterated through all possible thresholds to identify the globally optimal threshold with the highest ACC. For the LR method with dual cutoff thresholds, the ACC value corresponding to these two thresholds should be comparable to those of the best ML models. In the dual-threshold method, various thresholds are incorporated to provide a stable benchmark for comparison under different application scenarios and confidence levels. Considering that there might be multiple pairs of thresholds with the same level of ACC, baseline thresholds with the highest CR value were selected. Three sets of empirical thresholds (i.e., ±1, ±2, and ±3) were included. After determining these thresholds, a multi-classification workflow based on standard forensic practice was applied to validation set pairs with unknown relatedness. Multi-class performance was evaluated using the macro F1-score, which balances recall and precision across all classes.

RESULTS

Landscape of the basic forensic parameters for the 57 InDel loci in the CTG group

Adjusted by Bonferroni’s correction [p_HWE (p values of Hardy-Weinberg equilibrium) > 8.772E^-4, p_LD (p values of Linkage Disequilibrium) > 3.133E^-5], the 57 InDel loci in the CTG group were consistent with Hardy-Weinberg equilibrium, and had a relatively good performance in forensic individual identification [Supplementary Table 3]. No evidence of linkage disequilibrium was observed [Supplementary Table 4]. Table 4 also displays the basic forensic parameters of the 57 InDel loci. As shown in Supplementary Figure 1, most allelic frequencies in the East Asians from the 1KGP dataset fluctuated around 0.4-0.6, which was consistent with the forensic demand of performing individual identification and paternity testing in East Asian populations. Figure 3A shows the population genetic structure at different K values, with the CTG group highlighted centrally. The AGCU-60 panel identified three major ancestral components (African, East Asian, and European) across 36 global populations, clustering CTG with Han populations Chinese Han in Guangdong (CHG) and Chinese Han in Hunan (CHH). A maximum likelihood phylogenetic tree [Figure 3B] placed all 15 East Asian populations on the same major branch, with CTG and Chinese Hani in Yunnan (CHY) sharing the most recent common ancestor. Normalized fit residuals near zero for most East Asian pairs supported the tree’s reliability. However, higher residuals between Dingjie Sherpa (SHP) and CTG/Japanese in Tokyo (JPT) suggest closer genetic ties than shown, possibly due to admixture. Overall, no significant substructure was detected among the East Asian populations.

Figure 3. ADMIXTURE and Treemix analyses based on the 57 InDel loci genotyping data of individuals from different continental regions; (A) shows the ADMIXTURE analysis results when the K value is 2 to 7, and the CTG group is consistently displayed in the center; (B) is the maximum likelihood tree constructed by Treemix software and populations from the same intercontinental region were labeled by a specific color. A residual heatmap is also shown in the bottom left corner of the plot, and the color of each cell, as per the color scale legend, represents the normalized fit residual value between the pairwise populations. CTG: Tibetan Autonomous Prefecture of Gannan; ACB: African Caribbean in Barbados; ASW: African Ancestry in Southwest US; ESN: Esan in Nigeria; GWD: Gambian in Western Division, The Gambia; LWK: Luhya in Webuye, Kenya; MSL: Mende in Sierra Leone; YRI: Yoruba in Ibadan, Nigeria; CLM: Colombian in Medellin, Colombia; MXL: Mexican Ancestry in Los Angeles, California; PEL: Peruvian in Lima, Peru; PUR: Puerto Rican in Puerto Rico; CDX: Chinese Dai in Xishuangbanna, China; CDXG: Chinese Dongxiang in Gansu, China; CHB: Han Chinese in Bejing, China; CHC: Chinese Han in Chengdu, China; CHG: Chinese Han in Guangdong, China; CHH: Chinese Han in Hunan, China; CHHN: Chinese Han in Hainan, China; CHS: Southern Han Chinese, China; CHY: Chinese Hani in Yunnan, China; CLH: Chinese Li in Hainan, China; CMY: Chinese Miao in Yunnan, China; CTG: Chinese Tibetan in Tibetan Autonomous Prefecture of Gannan, China; JPT: Japanese in Tokyo, Japan; KHV: Kinh in Ho Chi Minh City, Vietnam; SHP: Dingjie Sherpa, China; CEU: Utah residents with Northern and Western European ancestry; FIN: Finnish in Finland; GBR: British in England and Scotland; IBS: Iberian populations in Spain; TSI: Toscani in Italy; BEB: Bengali in Bangladesh; GIH: Gujarati Indian in Houston, TX; ITU: Indian Telugu in the UK; PJL: Punjabi in Lahore, Pakistan; STU: Sri Lankan Tamil in the UK.

Table 4

Frequency data of 57 InDels in unrelated East Asian individuals

Rs ID	Deletion	Insertion	Insertion frequency
rs3067397	G	GTATCT	0.4191
rs10607699	C	CCCT	0.6870
rs71852971	T	TACTC	0.4008
rs139764906	C	CCTAA	0.4985
rs67487831	T	TTCAA	0.6137
rs11277697	T	TTTAGG	0.4061
rs113011930	C	CTTCT	0.3886
rs144941014	T	TGAA	0.6198
rs146875868	C	CTCTT	0.4534
rs145191158	T	TTTTG	0.5473
rs66477007	T	TAAGA	0.5481
rs10590825	C	CCCT	0.5092
rs3834231	C	CTAGG	0.5229
rs145577149	C	CAAAT	0.3626
rs35309403	T	TACTG	0.6336
rs140820428	G	GAGA	0.3748
rs66595817	T	TCTTTC	0.5198
rs66879403	T	TTGA	0.4542
rs60867863	G	GATTA	0.4298
rs567292477	T	TTATAAC	0.4710
rs79225518	C	CAAG	0.4802
rs3217112	C	CTAATA	0.5511
rs67426579	A	AGTG	0.5069
rs151335218	C	CAAGT	0.3916
rs142221201	C	CAAAG	0.4198
rs66649248	G	GATC	0.6557
rs57981446	T	TAGGAG	0.5015
rs67365630	C	CACT	0.5565
rs35464887	C	CTTTA	0.5473
rs76158822	T	TTTAAG	0.4878
rs5897566	T	TTAAC	0.6427
rs67405073	C	CTGA	0.3870
rs140683187	G	GAAC	0.4206
rs5787309	A	ATTATT	0.4763
rs34287950	G	GTTT	0.3710
rs67100350	G	GTAGT	0.6137
rs769299	G	GTATC	0.6053
rs3076465	C	CTTAT	0.3328
rs67939200	C	CTCA	0.4076
rs145941537	C	CAATT	0.6382
rs67264216	A	ATGTCG	0.3595
rs35453727	G	GAGA	0.5405
rs35065898	A	AACTT	0.5832
rs561160795	C	CTGG	0.3779
rs61490765	C	CTTAAT	0.5580
rs34419736	T	TAAG	0.4817
rs77635204	G	GAGAA	0.4603
rs34421865	A	ACTCT	0.4198
rs66739142	C	CTCTTT	0.6588
rs557813049	G	GTGTGC	0.6389
rs145010051	T	TGGA	0.4824
rs72031009	C	CTAGAG	0.4664
rs77206391	C	CACAA	0.4626
rs33971783	C	CTGTT	0.5725
rs72085595	T	TTGTC	0.2519
rs34529638	A	ACCT	0.4298
rs538690481	G	GTCTGAA	0.4649

Genetic background exploration of the East Asian populations

Based on the subpopulation frequency data, population-level PCA analyses [Supplementary Figure 2A] showed that 36 populations from the global dataset were roughly divided into three hierarchical clusters in the dimensions of the first two principal components (PCs), where no significant genetic substructure was found among the East Asians. It was worth noting that, although the SHP population was also distributed within the East Asian cluster, its average levels of PC2 and PC3 were higher compared to other East Asian populations [Supplementary Figure 2B]. As shown in Supplementary Table 5, no significant genetic differentiation was observed among 29 East Asian populations, and the F_IS value showed a low level of inbreeding among individuals within these populations. Supplementary Table 6 shows the pairwise Nei’s D_A values among the CTG group and 35 reference populations based on 57 InDel loci in the AGCU-60 panels. The above results all indicate that the East Asian dataset could serve as an input dataset for the subsequent kinship classifications.

Comprehensive assessment of multiple forensic kinship parameters

We filtered 655 individuals with no kinship within five degrees of kinship from the East Asian dataset of 3,200 individuals. Based on these individuals, the allele frequencies of 57 InDels were calculated [Table 4]. Supplementary Table 7 shows the pairwise kinship coefficient calculation results based on these 655 individuals. Using these data, we performed pedigree simulations with Familias. Based on the 40,000 simulated pairs, comprising different kinship parameters were calculated for the individual pairs, including IBS (IBS_0-2 and CIBS); IBD (k₀ coefficients and proportion IBD), and LR values under hypotheses of PO-UN, FS-UN and 2ND-UN. For k coefficients, proportion IBD, and IBS₀ with relatively discrete distributions, we applied data binning to transform them into histograms for a better visualization [Figure 4A and B]. As shown in Figure 4A, most kinship parameters exhibited significant skewness in their distributions. Among the IBD-related parameters, FS pairs showed distributions that were closest to the theoretical values presented in Table 1, while the observed values for the UN pairs differed the most. Although k₀ could distinguish between four types of relationship pairs [Supplementary Figure 3A]; k₁ primarily distinguished between PO and UN pairs; k₂ was mainly used to identify FS pairs; and IBD proportions for 2ND and UN pairs exhibited unique distributions, respectively. Most of the IBS-related parameters followed a normal distribution (except for the IBS₀ value). As shown in Figure 4B and Supplementary Figure 3B, IBS₀ could distinguish between PO and UN pairs; IBS₁ showed a unique distribution for FS pairs; IBS₂ and CIBS exhibited distribution differences among individual pairs with four types of relatedness. The LR-related parameters showed a normal distribution and appeared as three clusters: (1) UN pairs, (2) 2ND pairs, and (3) PO and FS pairs [Figure 4C]. Overall, LR values for specific kinship were generally highest under their corresponding hypotheses.

Figure 4. Density distribution maps of different kinship parameters. (A) presents the density distributions of IBD parameters (with bin width of 0.1), including k₀, k₁, k_2, and proportion IBD; (B) demonstrates the density distributions of IBS parameters, including IBS₀ (with bin width of 0.1), IBS₁, IBS_2, and normalized cumulative IBS; (C) displays the distributions of log₁₀(LR) values under different hypotheses. The vertical dashed lines represent the median value of the correspondence relationships. IBD: Identity-by-descent; IBS: identity-by-state; FS: full siblings; PO: parent-offspring; 2ND: 2nd-degree relatives; UN: unrelated individuals.

Figure 5A-C presents the t-SNE results based on data feature combinations for a single category of kinship parameters. Among the features of IBD, IBS and IBS+IBD [Figure 5A, B and D], despite evident non-linear boundaries between different kinship pairs, there were no discernible structural patterns. In the dimensionality reduction based on LR values [Figure 5C], a majority of PO and FS pairs clustered distinct from UN, but part of the 2ND pairs was still mixed with UN, FS, and PO pairs. When involving LR values in the feature combinations [Figure 5E-G]. In IBS+LR [Figure 5F], LR values and IBS values jointly influenced the data structure, reducing the mixing of UN with other kinship pairs to some extent. In the ALL combination [Figure 5G], the data structure was still dominated by LR parameters, and the mixing of UN and 2ND kinship pairs with other kinship pairs was further diminished. Even though the above visualizations provided insights into the distribution structure of the data, kinship classification efficiencies of these feature combinations still required the development of fine-tuned models for validation.

Figure 5. Visualizations of dimension reduction for different feature combinations based on the t-SNE algorithm. (A) and (B) are the t-SNE dimension reduction results for IBD parameters (k₀, k₁, k_2, and IBD proportions) and IBS parameters (IBS₀, IBS₁, IBS_2, and normalized cumulative IBS), respectively; (C) displays the results for log10(LR) values under different hypotheses; (D-G) show the results for feature combinations of IBD+IBS, IBD+log₁₀(LR), IBS+log₁₀(LR) and all mentioned features, respectively. Different colored dots represent different types of kinship pairs. IBD: Identity-by-descent; IBS: identity-by-state; FS: full siblings; PO: parent-offspring; 2ND: 2nd-degree relatives; UN: unrelated individuals; t-SNE: t-distributed stochastic neighbor embedding.

Efficacy evaluations of the binary and multi-classification ML models for kinship classifications

Hyperparameter configurations and corresponding model performance metrics can be found in Supplementary Table 8. As for the FS-UN classification, the LGBM model demonstrated the best performance (ACC = 0.9902). For the 2ND-UN task, the XGB model exhibited the best performance (ACC = 0.9194). Overall, despite its good interpretability in forensic practice, MNB had the lowest performance in both FS-UN and 2ND-UN binary classifications, with ACC values of 0.9474 and 0.7960, respectively. Although in the FS-UN classification, there was little difference in performance among models with different complexities; in the more challenging 2ND-UN classification, the ACC of three ensemble learning models (XGB, LGBM, and CATB) were all above 0.91, significantly higher than the average ACC of 0.84 for MLR and SVM models.

Performance metrics and hyperparameter configurations for fine-tuned models under different feature combinations were provided in Supplementary Table 9. Overall, the IBD, IBD+LR, and ALL feature combinations exhibited higher performance, with ACC values all exceeding 0.92, while optimal models under other combinations displayed ACCs below 0.90. We selected the IBD combination, which had the best performance among feature combinations based on a single kind of kinship parameter, and feature combinations with multiple kinship parameters for further evaluation. Figure 6A-D shows the Receiver Operating Characteristic (ROC) curves and learning curves for the optimal models under the IBD, IBS+LR, IBD+LR, and ALL feature combinations. It could be observed that the ALL combination had the best performance, followed by IBD+LR, IBD, and IBS+LR. Learning curves shown in Figure 6E-H indicated that IBD demonstrated the best fitting, followed by IBS+LR and ALL. For the IBD+LR combination, a difference of approximately 0.05 between the training set and test set in 10-fold cross-validation suggested that further increasing model complexity may lead to overfitting.

Figure 6. ROC curves and learning curves of the top four optimal models based on IBD, IBS+LR, IBD+LR, and all features. (A-D) show the ROC curves of the top four optimal models based on feature combinations of IBD, IBS+LR values, IBD+LR values, and all features, and AUC values of the one vs. rest classifiers for each class are listed in the bottom right corner; (E-H) display the learning curves of the optimal models based on feature combinations of IBD, IBS+LR values, IBD+LR values, and all features. IBD: Identity-by-descent; IBS: identity-by-state; LR: likelihood ratio; 2ND: 2nd-degree relatives; FS: full siblings; PO: parent-offspring; ROC: receiver operating characteristic; AUC: area under the curve; UN: unrelated individuals.

To enhance the interpretability of kinship classification models, we selected the model of ALL-LGBM for decision process visualization and feature importance analysis [Figure 7]. Figure 7A displays one of the decision trees of the model, where samples were progressively classified in multiple nodes based on different thresholds of kinship parameters. Figure 7B illustrates the distribution of feature importance in the optimal model of ALL-LGBM. Results indicated that the LR parameters played a dominant role in kinship multi-classifications, with LR_PO-UN being the most important one. Following that, the k₁ of IBD was the most crucial kinship parameter.

Figure 7. Decision tree visualization and feature importance of the best model (LGBM) based on all features in the modeling set. (A) shows a visualization for one of the 620 decision tree estimators in the developed LGBM model based on all features in the modeling set. In addition to showcasing the specific process of classification, the distribution ratio of samples is also displayed in each node and leaf, along with the data features and their corresponding thresholds upon which the classification is made; (B) displays the corresponding importance rank of each data feature based on the total gain of this feature's splits. LR_PO-UN, LR_FS-UN, and LR_2ND-UN are referred to as the LR value under hypotheses of PO-UN, FS-UN, and 2ND-UN. LR: Likelihood ratio; PO: parent-offspring; FS: full siblings; 2ND: 2nd-degree relatives; UN: unrelated individuals.

Comparing the methodologies of cutoff threshold-based and ML classification

We used an exhaustive search with a step size of 0.01 to find the single cutoff thresholds with the highest ACCs: PO-UN (1.54), FS-UN (-0.03), and 2ND-UN (-0.01). Considering the calculated CPE values and the ACCs of optimal models developed for binary classification tasks of FS-UN and 2ND-UN, log₁₀(LR) dual-thresholds were obtained: PO-UN (0.58/3.11), FS-UN (-1.30/0.22), and 2ND-UN (-0.49/0.53). Common empirical thresholds, such as ±1, ±2, ±3, etc., were also applied for methodological comparison. Performance metrics corresponding to these thresholds in the modeling set are presented in Table 5. When the ML workflow based on LGBM achieved the highest accuracy (0.9902), in the traditional workflow at the same accuracy level, dual thresholds were selected as -1.30 and 0.22, respectively. At this point, the threshold-based method had a CR of 0.9553 and an SP of 0.9459, both lower than the ML method’s values of 1.0 and 0.9902, respectively. Compared to the single-threshold method, binary kinship classifications based on dual thresholds gained a significant increase in ACC values, but at the cost of sacrificing CR values. This trade-off was particularly evident in the empirical threshold-based 2ND-UN classification. While the ML workflow based on XGB achieved the highest accuracy (0.9194), in the common workflow at the same accuracy level, dual thresholds were selected as -0.49 and 0.53, respectively. Herein, the common and ML workflows presented CRs of 0.7066 and 1.0, as well as SPs of 0.6528 and 0.9194, respectively. Despite having clear interpretability and numerous practical applications, this methodology led to a substantial decrease in CR and SP values.

Table 5

Comparison of kinship binary classification methodologies based on log₁₀(LR) values in the modeling set

Kinship pair	Methodology	Threshold value	ACC	CR	SP
PO/UN	Single threshold¹	1.54	0.9952	1	0.9952
PO/UN	Dual threshold	0.58/3.11	0.9999	0.9120	0.9119
FS/UN	Single threshold¹	-0.03	0.9811	1	0.9811
	Dual thresholds	-1.30/0.22	0.9902²	0.9553	0.9459
		±1	0.9950	0.9386	0.9339
		±2	0.9993	0.8324	0.8318
		±3	0.9999	0.6618	0.6617
	ML model (LGBM)	/	0.9902	1	0.9902
2ND/UN	Single threshold¹	-0.01	0.8386	1	0.8386
	Dual thresholds	-0.49/0.53	0.9194³	0.7066	0.6528
		±1	0.9593	0.4406	0.4227
		±2	0.9950	0.0865	0.0861
		±3	0.9999	0.0070	0.0070
	ML model (XGB)	/	0.9194	1	0.9194

¹ represented the highest accuracies of applying a single threshold in the PO, FS and 2ND kinship classifications; ² and ³ represented the accuracies of threshold-based methods that were close to the optimal ML binary models in the FS and 2ND kinship classification tasks, respectively. Metrics of the two optimal ML models for FS and 2ND kinship classification were displayed in the table. LR: Likelihood ratio; ACC: accuracy; CR: conclusiveness rate; PO: parent-offspring; FS: full siblings; 2ND: 2nd-degree relatives; ML: machine learning; UN: unrelated individuals; SP: system power.

Subsequently, we conducted a kinship multiclassification in the validation set based on the baseline dual-thresholds [Table 6]. Detailed metrics for the remaining ML models can be found in Supplementary Table 10. Performance metrics for the threshold-based methods were significantly lower than those for ML methods. In the multiclassification without assumption on relatedness, the methods of single-threshold and dual thresholds displayed macro F1 values of 0.6955 and 0.5212, as well as CRs of 1.0 and 0.70464. Although the dual thresholds method exhibited higher precision (0.8271) compared to the single-threshold method, the overall recall and macro F1 value of the dual-threshold method was significantly lower (0.5740) due to the existence of the “gray zone”. Among the ML models, the IBD+LR-XGB model exhibits the optimal classification performance (F1 = 0.9020), while the ALL-XGB model ranks second (F1 = 0.9008). Additionally, these two models not only demonstrated higher performances but also had shorter fitting times (0.4615 s and 0.6484 s).

Table 6

Performance metrics of optimal classifiers based on different feature combinations for kinship multiple classification tasks in the validation set

Feature combination	Optimal classifier	Macro F1 score	Precision	Recall	CR	Fit time(s)
log₁₀(LR)	Single threshold	0.6955	0.7375	0.7069	1	-
	Dual thresholds	0.5212	0.8271	0.5740	0.7046	-
	XGB	0.7998	0.8005	0.8009	1	0.7981
IBS	LGBM	0.8003	0.8024	0.8029	1	0.0529
IBD	RFC	0.8982	0.8989	0.8991		3.5254
IBS+ log₁₀(LR)	CATB	0.8204	0.8217	0.8223		3.4871
IBD+ log₁₀(LR)	XGB	0.9020	0.9023	0.9023		0.4615
ALL	XGB	0.9008	0.9009	0.9011		0.6484

ACC is the balanced accuracy; IBS represents IBS scores based on data features of IBS₀, IBS₁, IBS_2, and cumulative IBS (CIBS); IBD has referred to IBD-based data features of k₀, k₁, k_2, and IBD proportions; ALL was a full combination of all features mentioned in this table. CR: conclusiveness rate; RFC: random forest classifier; XGB: eXtreme gradient boosting; LGBM: light gradient boosting machine; CATB: categorical boosting.

Figure 8 presents the normalized confusion matrices for the cutoff threshold methods and ML models. It could be concluded that the single-threshold method was suitable for the PO-UN classification task but struggled to differentiate between FS and 2ND pairs. Despite moderate improvement observed in the dual cutoff thresholds, FS pairs still tended to be misclassified as PO. Faced with the challenging task of 2ND pairs classification with the InDel capillary electrophoresis panel, IBS+LR-CATB, and IBD-RFC models exhibited lower accuracies. In contrast, models that incorporated both IBD and LR kinship parameters, such as IBD+LR-XGB and ALL-XGB, showed significant improvements. Overall, the incorporation of various kinship parameters and ML models was considered beneficial for forensic kinship analysis based on the InDel capillary electrophoresis system, especially evident in terms of multiclassification.

Figure 8. Normalized confusion matrices based on the thresholding method and four machine learning models with better performance under kinship multi-classification tasks in the modeling set. (A and B) are confusion matrices of single-threshold and dual-threshold methods; (C-F) show the confusion matrices of the top four optimal models based on IBD, IBS+LR, IBD+LR, and all features, respectively. IBD: Identity-by-descent; IBS: identity-by-state; LR: likelihood ratio; ML: machine learning; FS: full siblings; PO: parent-offspring; 2ND: 2nd-degree relatives; UN: unrelated individuals.

DISCUSSION

Overview of the forensic basic parameters and genetic background

In the CTG group, the 57 InDel loci panel was capable of forensic individual identification and paternity testing in East Asian populations more than other existing commercial systems^[26,27]. Population genetic investigations indicate that the performance of individual identification and paternity testing is relatively robust in the studied East Asian populations. As mentioned in Section 3.2, we observed a latent population substructure among the SHP group and other East Asian populations in PCA, forming two clusters in the PC1-PC3 and PC2-PC3 dimensions. However, the analysis of molecular variance did not reveal statistically significant inter-population genetic differences within these East Asian populations. Furthermore, the latent population substructure was also excluded by the results of Nei’s D_A distances among the studied East Asian populations. These results also suggested that the potential nonsignificant population substructure may not be attributed to genetic differences. Possible explanations included potential batch effects arising from the 1KGP data derived from whole-genome sequencing, differing from the CE-platform-based 57 InDel loci data. Regarding the SHP population, the observed results in PCA might be influenced by the larger sample size (n = 628). In summary, the merged data of 15 populations in the East Asian dataset represent a consistent genetic distribution and it is acceptable to conduct kinship simulations based on this merged dataset. In the results section, we observed that for IBD-related parameters, the observed values for UN pairs showed the greatest variation. The cause of this issue may be one of the limitations of this study: the relatively small number of genetic markers led to an increased false positive rate in MoM algorithm which is typically calculated based on high-density markers. Additionally, among the 655 individuals used to simulate UN pairs in this study, approximately 0.82% (1761) of the resulting 214,185 kinship coefficient values fell within the range of 0-0.0625. This may have caused the simulation of UN pairs to deviate from real-world scenarios.

Forensic kinship classifications from the machine learning perspective

Even though many forensic genomics studies combining ML algorithms were intended to directly input raw genotyping data^[28-30], models developed after data preprocessing have been proven more cost-effective and accurate. Compared to the existing dimensionality reduction methods, we believe that kinship parameters of IBD, IBS, and LR are more advisable to use, since these parameters are designed according to Mendel’s Law. ML model developed based on this can be more interpretable and able to be compared directly with the LR cutoff threshold-based method. Therefore, based on the merged East Asian dataset, we constructed a series of binary kinship classification models under LR values and then trained the all-in-one multi-classification models by using 6 different feature combinations. As discussed in Section 3.3, the distribution characteristics of various types of kinship parameters might impact forensic kinship analysis and ML model construction. For example, assuming no Mendelian errors, parameters such as IBS₀ and k₀ could differentiate between PO and other relatedness pairs. k₂ displayed a significant impact on classifying FS pairs, which LR values cannot solely achieve under certain hypotheses. Visualization results of multiple kinship parameters showed that, with the introduction of different types of kinship parameters, the mixing between different relatedness pairs gradually decreased. Based on these results, we believe that combining multiple parameters can increase the information about the relationship between two individuals, which provides a theoretical foundation for subsequent ML model construction based on various feature combinations.

Methodology comparison in the kinship binary classification of FS-UN from the modeling set showed that the ML method did not demonstrate an evident advantage when compared to the single-threshold method, with differences between SP values less than 0.01. For example, there were minor performance differences between the mathematical model of MLR and the optimal model of LGBM. This can be ascribed to the fact that the scale and dimensions of the input dataset are rather small (< 50 k objects). Even though this task can be properly tackled by mathematical models such as MLR and SVM, according to Occam’s Razor Principle, the simpler method of cutoff threshold should be finally applied in such cases. However, due to the bi-allelic nature of the studied InDel loci, the classification task for the 2ND kinship is more difficult. In such a scenario, the application of ML algorithms, especially the ensemble algorithms, was observed to effectively improve the SP values in the classification of 2ND pairs including HS, AC, and GP relatedness. By comparing the performances of the cutoff threshold-based method and ML models based on log₁₀(LR_2ND-UN) values in the 2ND kinship binary classification, we found that the ML models achieved equivalent ACCs without sacrificing CR values, and the XGB model was generally considered the best-performing. However, due to the No Free Lunch Theorem^[31], there is no such perfect model that could universally fit all kinds of data and prediction tasks and only the optimal fitting model for the particular dataset. In the following multi-kinship analysis based on different data feature combinations, the most suitable models were respectively LR-XGB, IBS-XGB, IBD-RFC, IBS+LR-XGB, IBD+LR-LGBM, and ALL-LGBM. At the same time, increasing the data feature dimensions could improve the efficiency of ML models in distinguishing individual pairs with different kinship, especially for 2ND and UN pairs. However, this improvement was achieved at the cost of a higher model variance close to 0.05, which might be caused by the increased model complexity. Moreover, with different feature combinations used in the dataset, the fitting speed of SVM and RFC decreased with the increase of dataset features. As for the CATB model, it was the slowest-fitting model among all ensemble models, with its speed largely dependent on the structure of the dataset, making its training time significantly longer in the small dataset^[32]. In general, under the 6 different data feature combinations, ensemble learning algorithms of RFC, XGB, and LGBM constantly showed promising classifying abilities and fast fitting speeds. These ensemble learning algorithms, especially boosting algorithms, had greater application potential in the scenario of forensic kinship multi-classification.

In this study, the common kinship analysis workflow refers to calculating LR values for all kinship hypotheses based on two individuals with unknown relatedness. Then, the assumption of kinship should be confirmed based on the maximum LR value, and the predefined thresholds are used to validate the final conclusion. However, this method has the following limitations: (1) One of the most essential parts of the common workflow is to estimate the most possible relationship based on the maximum LR value. Although research based on sequencing data suggested that the LR values for specific kinship pairs were generally the highest under their respective hypotheses^[15], our study based on 57 InDel loci did not consistently support this, introducing a risk of misjudgment for kinship identification in InDel-CE panels with lower polymorphism; (2) Individuals between the dual thresholds are considered “undetermined”. Their potential relatedness information is typically ignored in routine practice, leaving a risk of false negative results; (3) Without a priori range of the relatedness between individual pairs, kinship analysis requires calculating LR values under every possible kinship hypothesis. Although this method is acceptable for case analyses, its computation load and time complexity may be significantly increased in the real-time parallel comparisons within large forensic genotyping databases. Fortunately, by introducing multiple feature combinations and ML algorithms, the above limitations can be preliminarily tackled. As shown in Section 3.3.1, the mutual confirmation of various kinship parameters enabled the ML classifiers to learn patterns between different kinship pairs from a more comprehensive perspective, resulting in higher values of ACC. The advantage of ML classifiers is that the kinship multi-classification could be performed without suffering from inconclusiveness, leading to a higher CR value. Moreover, the developed ML method does not require an exhaustive search for all potential kinships, allowing for a fast preliminary family or pedigree screening in the existing database.

With relatively short amplification fragments, InDel-CE panels were suitable for the detection of degraded biomaterials, and it was easier to popularize in front-line forensic practice. In the future, supplementary ML models developed on InDel-CE panels can effectively improve the ACC and SP of the identification of the 2ND kinship. However, it is essential to emphasize that in this study, the main intention of ML models is to improve CR values, facilitating the deployment of models in the large-scale preliminary screening of relatedness within forensic genotyping databases. However, in the context of forensic practice, there is still a need to develop a more comprehensive system for evidence interpretation. Compared with the common workflow, the ML classifiers used in this study did not allow for improving ACC values by sacrificing CR values, resulting in a classification accuracy that was difficult to meet the forensic practice requirement of 0.9999 and a weaker strength of evidence for forensic practice. We believe that incorporating classification probabilities might be a direction worth exploring. By setting a decision threshold based on the classification probability for each type of relatedness, the accuracy might be increased accordingly. However, this classification probability is determined by the input dataset and will not faithfully reflect the real-world situation. Large-scale validation and more real-world pedigree samples are still necessary to establish relatively reliable prior probabilities and further validate the ML methodology.

DECLARATIONS

Acknowledgments

Great contributions of forensic researchers for expanding the AGCU-60 population database are much appreciated, and we also gratefully thank the participants in this study.

Authors’ contributions

Conceptualization: Lei F, Wu X, Zhu B

Data curation, formal analysis, project administration, validation: Lei F, Liu Q

Funding acquisition, supervision: Xie T, Zhu B

Investigation: Liu Q, Wu X

Methodology: Lei F, Liu Q, Xie T

Resources: Zhu B

Software: Lei F, Wu X

Visualization, writing - original draft: Lei F

Writing - review & editing: Wu X, Xie T, Zhu B

Availability of data and materials

The raw data for this article are available upon reasonable request to the corresponding authors with the permission of the authorities.

AI and AI-assisted tools statement

Not applicable.

Financial support and sponsorship

This study was funded by the National Natural Science Foundation of China (Nos. 82293650, 82293652, and 82572152) and the National Key R&D Program of China (2022YFC3302004-1).

Conflicts of interest

Zhu B is a Section Editor of Journal of Translational Genetics and Genomics. Zhu B is also the Guest Editor of the Special Issue entitled “Topic: Molecular Innovation in Forensic Genetics” in the Journal of Translational Genetics and Genomics. Zhu B was not involved in any steps of the editorial process, notably including reviewers’ selection, manuscript handling, or decision-making. The other authors declare that there are no conflicts of interest.

Ethical approval and consent to participate

The present study strictly adhered to the ethical guidelines of the Helsinki Declaration and was approved by the Ethics Committee of Xi’an Jiaotong University Health Science Center and Southern Medical University (NO. 2019-1039). All subjects voluntarily signed the informed consent form.

Consent for publication

Not applicable.

Copyright

Supplementary Materials

REFERENCES

1. Pereira R, Phillips C, Alves C, Amorim A, Carracedo A, Gusmão L. A new multiplex for human identification using insertion/deletion polymorphisms. Electrophoresis. 2009;30:3682-90.

2. Manta F, Caiafa A, Pereira R, et al. Indel markers: genetic diversity of 38 polymorphisms in Brazilian populations and application in a paternity investigation with post mortem material. Forensic Sci Int Genet. 2012;6:658-61.

3. Zhang YD, Shen CM, Jin R, et al. Forensic evaluation and population genetic study of 30 insertion/deletion polymorphisms in a Chinese Yi group. Electrophoresis. 2015;36:1196-201.

4. Fan H, He Y, Li S, et al. Systematic evaluation of a novel 6-dye direct and multiplex PCR-CE-based InDel typing system for forensic purposes. Front Genet. 2021;12:744645.

5. Liu J, Du W, Jiang L, et al. Development and validation of a forensic multiplex InDel assay: the AGCU InDel 60 kit. Electrophoresis. 2022;43:1871-81.

6. Chen X, Nie S, Hu L, et al. Forensic efficacy evaluation and genetic structure exploration of the Yunnan Miao group by a multiplex InDel panel. Electrophoresis. 2022;43:1765-73.

7. Fang Y, Zhao C, Jin X, et al. Genetic characterization evaluation of a novel multiple system containing 57 deletion/insertion polymorphic loci with short amplicons in Hunan Han population and its intercontinental populations analyses. Gene. 2022;809:146006.

8. Chen M, Cui W, Bai X, et al. Comprehensive evaluations of individual discrimination, kinship analysis, genetic relationship exploration and biogeographic origin prediction in Chinese Dongxiang group by a 60-plex DIP panel. Hereditas. 2023;160:14.

9. Xu H, Nie S, Hu L, et al. Comprehensive understanding the forensic systematic effectiveness in Chinese Yunnan Hani group and intercontinental population Architecture differentiation analyses via a novel set of autosomal InDel markers. Front Biosci. 2023;28:5.

10. Chakraborty R, Jin L. Determination of relatedness between individuals using DNA fingerprinting. Hum Biol. 1993;65:875-95.

11. Weir BS, Anderson AD, Hepler AB. Genetic relatedness analysis: modern data and new challenges. Nat Rev Genet. 2006;7:771-80.

12. Kling D, Tillmar A. Forensic genealogy-A comparison of methods to infer distant relationships based on dense SNP data. Forensic Sci Int Genet. 2019;42:113-24.

13. Coble MD, Buckleton J, Butler JM, et al. DNA Commission of the International Society for Forensic Genetics: recommendations on the validation of software programs performing biostatistical calculations for forensic genetics applications. Forensic Sci Int Genet. 2016;25:191-7.

14. Heinrich V, Kamphans T, Mundlos S, Robinson PN, Krawitz PM. A likelihood ratio-based method to predict exact pedigrees for complex families from next-generation sequencing data. Bioinformatics. 2017;33:72-8.

15. Galván-Femenía I, Barceló-Vidal C, Sumoy L, Moreno V, de Cid R, Graffelman J. A likelihood ratio approach for identifying three-quarter siblings in genetic databases. Heredity. 2021;126:537-47.

16. Xu Q, Wang Z, Kong Q, et al. Improving the system power of complex kinship analysis by combining multiple systems. Forensic Sci Int Genet. 2022;60:102741.

17. Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery, New York, NY, USA, 2016: pp. 785-94.

18. Ke G, Meng Q, Finley T, et al. LightGBM: a highly efficient gradient boosting decision tree. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 3149-57. Available from https://hal.science/hal-03953007/ [accessed 30 March 2026].

19. Byrska-Bishop M, Evani US, Zhao X, et al. ; Human Genome Structural Variation Consortium. High-coverage whole-genome sequencing of the expanded 1000 genomes project cohort including 602 trios. Cell. 2022;185:3426-3440.e19.

20. Wang M, Du W, Tang R, et al. Genomic history and forensic characteristics of Sherpa highlanders on the Tibetan Plateau inferred from high-resolution InDel panel and genome-wide SNPs. Forensic Sci Int Genet. 2022;56:102633.

21. Gouy A, Zieger M. STRAF-A convenient online tool for STR data evaluation in forensic genetics. Forensic Sci Int Genet. 2017;30:148-51.

22. Excoffier L, Lischer HE. Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour. 2010;10:564-7.

23. Kling D, Tillmar AO, Egeland T. Familias 3 - extensions and new functionality. Forensic Sci Int Genet. 2014;13:121-7.

24. Geman S, Bienenstock E, Doursat R. Neural networks and the bias/variance dilemma. Neural Comput. 1992;4:1-58.

25. Neal B. On the bias-variance tradeoff: textbooks need an update. arXiv. 2019;arXiv:1912.08286.

26. LaRue BL, Ge J, King JL, Budowle B. A validation study of the Qiagen Investigator DIPplex® kit; an INDEL-based assay for human identification. Int J Legal Med. 2012;126:533-40.

27. Pereira R, Gusmão L. Capillary electrophoresis of 38 noncoding biallelic mini-Indels for degraded samples and as complementary tool in paternity testing. Methods Mol Biol. 2012;830:141-57.

28. Alladio E, Poggiali B, Cosenza G, Pilli E. Multivariate statistical approach and machine learning for the evaluation of biogeographical ancestry inference in the forensic field. Sci Rep. 2022;12:8974.

29. Sun K, Yao Y, Yun L, et al. Application of machine learning for ancestry inference using multi-InDel markers. Forensic Sci Int Genet. 2022;59:102702.

30. Pilli E, Morelli S, Poggiali B, Alladio E. Biogeographical ancestry, variable selection, and PLS-DA method: a new panel to assess ancestry in forensic samples via MPS technology. Forensic Sci Int Genet. 2023;62:102806.

31. Wolpert D, Macready W. No free lunch theorems for optimization. IEEE Trans Evol Comput. 1997;1:67-82.

32. Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems; Red Hook, NY, USA: Curran Associates Inc.; 2018. p. 6639-49. Available from https://proceedings.neurips.cc/paper_files/paper/2018/file/14491b756b3a51daac41c24863285549-Paper.pdf. [accessed 30 March 2026].

Cite This Article

Original Article

Open Access

Kinship classification from a machine learning perspective: a pilot study based on genotyping data

How to Cite

Download Citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click on download.

Export Citation File:

RIS BibTeX EndNote

Type of Import

Direct Import Indirect Import

Tips on Downloading Citation

This feature enables you to download the bibliographic information (also called citation data, header data, or metadata) for the articles on our site.

Citation Manager File Format

Use the radio buttons to choose how to format the bibliographic data you're harvesting. Several citation manager formats are available, including EndNote and BibTex.

Type of Import

If you have citation management software installed on your computer your Web browser should be able to import metadata directly into your reference database.

Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.

Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.

About This Article

Special Topic

This article belongs to the Special Topic Topic: Molecular Innovation in Forensic Genetics

Disclaimer/Publisher’s Note: All statements, opinions, and data contained in this publication are solely those of the individual author(s) and contributor(s) and do not necessarily reflect those of OAE and/or the editor(s). OAE and/or the editor(s) disclaim any responsibility for harm to persons or property resulting from the use of any ideas, methods, instructions, or products mentioned in the content.

Copyright

© The Author(s) 2026. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Data & Comments

Data

Views

18

Downloads

1

Citations

0

Comments

0

Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at [email protected].

⁰

Download PDF

Download XML 1 downloads

Cite This Article 0 clicks

Export Citation 0 clicks

Like This Article 0 likes

Share This Article

https://www.oaepublish.com/articles/jtgg.2025.109?to=comment

Scan the QR code for reading!

See Updates

Contents

Figures

Kinship classification from a machine learning perspective: a pilot study based on genotyping data

Abstract

Graphical Abstract

Keywords

INTRODUCTION

METHODS

Sample collection and reference data preprocessing

DNA extraction, polymerase chain reaction amplification, and genotyping

Statistical analysis of forensic parameters and genetic background

Calculation of multiple kinship parameters

Common and machine learning workflows for forensic kinship analysis

Developing ML models for kinship classification based on LR values and different feature combinations

Methodological comparison of log10(LR) cutoff threshold-based and ML methods for kinship classification

RESULTS

Landscape of the basic forensic parameters for the 57 InDel loci in the CTG group

Genetic background exploration of the East Asian populations

Comprehensive assessment of multiple forensic kinship parameters

Efficacy evaluations of the binary and multi-classification ML models for kinship classifications

Comparing the methodologies of cutoff threshold-based and ML classification

DISCUSSION

Overview of the forensic basic parameters and genetic background

Forensic kinship classifications from the machine learning perspective

DECLARATIONS

Acknowledgments

Authors’ contributions

Availability of data and materials

AI and AI-assisted tools statement

Financial support and sponsorship

Conflicts of interest

Ethical approval and consent to participate

Consent for publication

Copyright

Supplementary Materials

REFERENCES

Cite This Article

How to Cite

Download Citation

Export Citation File:

Type of Import

Tips on Downloading Citation

Citation Manager File Format

Type of Import

About This Article

Special Topic

Copyright

Data & Comments

Data

Comments

Share This Article

See Updates

Committee on Publication Ethics

Portico

Committee on Publication Ethics

Portico

Methodological comparison of log₁₀(LR) cutoff threshold-based and ML methods for kinship classification