INTRODUCTION
The integration of artificial intelligence (AI) into spine surgery has given rise to significant improvements in patient safety, peri-operative decision making, and clinical outcomes[1]. As new technological innovations herald faster, more efficient, and more accurate AI models, it is imperative for surgeons to understand the impact of AI on current treatment paradigms and where spine surgeons’ focus should lie as we assist in the development of AI-enabled personalized and precision medicine.
At the cornerstone of clinical advancement with AI are machine learning (ML) models, capable of identifying and extracting patterns from large datasets and making predictions based on learned trends. As the availability of data grows, ML model performance continues to improve; therefore, the advancement of AI in medicine is uniquely tied to our ability to provide these models with accurate and pertinent datapoints. In this perspective, we provide a brief historical outline of current ML and AI applications in spine surgery. We then offer our thoughts on where the future of AI and spine surgery lies, and how the unique relationship between model accuracy and data volume will shape the future of how AI is implemented in clinical contexts.
CURRENT AI APPLICATIONS IN SPINE SURGERY
One of the earliest and most compelling uses of ML in spine surgery has been the use of models to automatically decipher radiographic images. For example, the classification of lumbar disc degeneration from 2-dimensional magnetic resonance image (MRI) using ML has now reached levels comparable to expert radiologists[1-3]. The morphology of the discs is first described according to their pathological features and classified according to the standardized grading system proposed by Pfirrmann et al.[4]. A convolutional neural network (CNN) is then used to extract image features from the training data set to make predictions based on the radiologists’ interpretations. CNNs, a specialized subtype of deep learning (DL) algorithms, parallel the architecture of human visual cortex processing and rely on unsupervised pattern recognition to classify images. CNN-based models for image classification are typically validated through a combination of k-fold cross-validation on training data and then tested on independent and external datasets to ensure generalizability. Other groups have also explored the use of generative models to create image-to-image translations of the musculoskeletal system[5,6]. Clinically, this can provide a means to correct poor image resolution or blurriness due to patient motion during image acquisition.
As DL algorithms became more prevalent, they have gradually been implemented to automatically determine spinal landmarks to calculate deformity parameters. DL models are trained on large datasets to identify and classify complex phenomena through non-linear analysis in artificial neurons, similar in structure to the mammalian brain[7]. The automated analysis of the Cobb angle to describe the severity of scoliotic curvature has been addressed through several DL techniques[8-10]. Korez et al. also used DL to identify anatomical landmarks in X-ray images and measure spinopelvic parameters, finding no difference between DL and manual identification[11].
The transformative capability of AI can expedite diagnosis and treatment planning, and has the potential to standardize surgical treatment strategies for various spinal pathologies after taking patient-specific factors into account. Widespread implementation, however, faces substantial ethical challenges as the prospect of removing human interpretation may lead to more patient distrust in conclusions. It is unlikely, then, that human radiologists will be replaced by AI technology; instead, their diagnostic accuracy will be improved as models continue to advance.
The advent of AI-powered predictive modeling also holds immense promise in the realm of personalized precision medicine. By assimilating vast repositories of patient data, including demographic information, comorbidities, and procedural specifics, AI algorithms can generate prognostic models tailored to individual patients, ushering in a new era where therapeutic decisions are guided by each patient’s unique physiology. This is particularly important for patient risk stratification, where clinical variables can be used as inputs (predictors) for the potential of operative complications. Pellisé et al. trained a random forest algorithm with clinical variables from 1,612 patients with adult spinal deformity (ASD) and identified age, surgical invasiveness, and deformity magnitude as potential risk factors for major complications[12]. Predictive models, such as random forest algorithms for complication risk stratification, undergo internal validation through cross-validation and are, at times, externally validated using datasets from different clinical settings to evaluate model transferability. In the study by Pellisé et al., internal validation was performed with an 80%/20% split between training/testing groups, measuring model performance through the observed area under the receiver operating characteristic curve (AUC) and the Brier score[12]. Ames et al. augmented this approach by applying unsupervised hierarchical clustering to classify ASD based on patient demographics and radiographic measurements with the goal of constructing a risk-benefit grid as a preoperative tool for decision making[13].
Current work continues to build upon existing outcomes prediction and postoperative prognostication. ML has been implemented to assess the likelihood of surgical site infection, major intra-operative complications, hospital length of stay, or the necessity of blood transfusion after surgery[14-17]. This has led to the development of universal prediction models trained retrospectively on large patient registries, such as the American College of Surgeons National Surgical Quality Improvement Project (ACS-NSQIP) database. The ACS-NSQIP developed an online calculator for morbidity and mortality risk, but reports demonstrated poor predictive performance[18]. Other groups have used the available ACS-NSQIP patient data as a resource to train their own models, with early indications of clinical efficacy at predicting outcomes[19,20]. Fully unsupervised models have extensive utility to revolutionize personalized care and stratify risk; however, deploying under-validated AI tools can lead to inaccurate diagnoses or inappropriate treatment recommendations, so caution is needed to ensure patient safety.
Lastly, an emerging implementation of ML and AI has been in the realm of outcomes assessment. Traditionally, evaluation of surgical outcomes relies on physician interpretation of radiographic imaging combined with patient questionnaires assessing changes in patient mobility, pain, and quality of life. These patient-reported outcome measures (PROMs) offer valuable insight into patients’ own interpretation of their health status and physical function. However, these methods contain inherent subjectivity and often lack the precision and reliability needed for precise and actionable insights[21,22]. More recently, there has also been a trend toward utilizing digital biomarkers and data-driven outcomes measurements in conjunction with traditional PROMs. Objective measurements of patient mobility obtained from patient smartphones, smartwatches, or other biometric wearables can add additional unbiased insight into patient function[23-26]. The quantitative and continuous features of these data are well suited for integration with data-driven statistical and ML techniques, and they have enabled surgeons to better quantify changes in patient mobility after surgery and to predict which patients may be better suited to recover from a particular pathology[24,25].
FUTURE DIRECTIONS
The use of accelerometer and GPS information is a relatively novel concept, and more complex ML predictive models have yet to be applied. The incorporation of such models could significantly improve the accuracy of patient assessments by providing real-time, continuous data that captures a patient’s functional mobility in their everyday life. This can lead to a more detailed understanding of a patient’s functional baseline status and postoperative recovery, resulting in tailored personalized medicine. While many analyses of mobility data have been retrospective in nature, upon the growth of adequate datasets, predictive models may be able to accurately identify subtle changes in mobility-related complications or improvements earlier than would be possible with traditional assessments.
Further, advanced mobility metrics can add potential value for patient prognostication. As previously mentioned, groups are beginning to engineer universal prognostic models for outcome prediction trained on large data registries[19,20]. Although still in their infancy, accurate prognostic models could transform patient management by offering more realistic recovery trajectories, customizing patient care, or identifying high risk for adverse outcomes. There are still challenges that limit the widespread implementation of such models, ranging from access to generalizable datasets, cost-effectiveness for stable implementation, or ethical concerns.
Mobility metrics are not the only AI application that is challenged with limited data availability. Access to high-quality, standardized data sets is one of the greatest challenges to overall AI and ML implementation, especially within spine surgery, given the varied and nuanced model inputs spanning complex patient presentations, operative courses, and radiographic imaging. To address this challenge, there is a growing movement toward the creation of standardized, multi-center datasets that include patients from several geographic areas and socioeconomic groups. Other groups such as the ACS are refining their existing patient registries to integrate additional data from the electronic health record. Together, these datasets and registries aim to provide a foundation for training more accurate and generalizable AI models that can be deployed across various clinical settings.
Patient selection is another area of current clinical practice that stands to benefit from future AI and ML integration. The art of understanding which patients will benefit from certain procedures is not easily replicated with frameworks and rules that can be directly input into computerized programs. However, as CNNs and ML algorithms continue to grow in computational ability, they can potentially identify relationships between datapoints that are otherwise unnoticeable to the un-aided human mind; in this way, future AI and ML models can augment surgeons’ clinical practice and assist in identifying certain patient characteristics that are indicative of patients likely to benefit from specific surgical interventions.
While AI technologies like predictive modeling and image analysis hold promise in decision making, their potential intra-operative impact is already apparent[1,7]. AI-assisted intra-operative tools, such as robotics, navigation systems, and mixed reality, have the potential to significantly enhance the surgeon’s ability to execute procedures with high precision, particularly in minimally invasive and percutaneous surgeries. These technologies allow for real-time guidance and adjustment during complex procedures, reducing the margin of error. However, while AI can minimize the risk of intra-operative errors, it cannot fully replace the human element of adaptability and judgment. Surgeons must remain vigilant in managing unforeseen intra-operative variables and complications, as AI systems, though highly advanced, still require human oversight to ensure patient safety and the proper handling of unexpected challenges.
Although surgeon experience is regarded as a significant factor in decision making, there have been attempts to apply mathematical and data-driven approaches to surgical decision making[27]. Lewandrowski et al. recently used the Rasch model to determine the choice of procedure for endoscopic lumbar decompression[27]. The Rasch model is a logistic function analyzing categorical data, such as questionnaire responses, to find the relative difficulty of a task, and it has been widely established in education, marketing, and health economics[28]. However, it was found that there was still disagreement among surgeons regarding the ability to achieve adequate clinical outcomes, indicating that increased granularity through additional metrics is needed to overcome the disordered responses[27].
Despite the promising advancements of AI in spine surgery, a significant limitation in the current literature is the lack of external validation of many studies. Most models are only internally validated on the same data from which they were derived, raising concerns about model generalizability to larger patient populations or different clinical settings. It was estimated that only 5% of published articles on prognostic models included an external validation framework[29]. Without external validation, it is difficult to ensure that these AI models will perform reliably in diverse environments, further limiting their clinical application. This issue is compounded by the scarcity of randomized controlled trials (RCTs) investigating AI in spine surgery, which are essential for evaluating long-term effectiveness and accuracy.
Due to the lack of standardized reporting metrics for AI studies, it is imperative to create clear guidelines through which the risk of bias and the potential utility of these models can be evaluated. AI studies that focus primarily on diagnostic applications using medical imaging should adhere to the Checklist for Artificial Intelligence in Medical Imaging (CLAIM)[30]. The forthcoming Standards for Reporting of Diagnostic Accuracy Studies for AI (STARD-AI), an AI-specific adaptation of the established STARD guidelines, is also under development. Upon its release, it is expected to be indexed on the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) Network, addressing similar methodological issues as those covered by CLAIM[31].
For ML multivariable prediction models, whether diagnostic or prognostic, the recently published Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis + Artificial Intelligence (TRIPOD + AI) provides a structured protocol for reporting predictive algorithms[32]. Despite the advancements since the initial 2015 TRIPOD statement, which has shown promise in improving methodological transparency[32,33], substantial gaps persist that hinder the broader integration of AI in clinical practice[34]. As AI prediction algorithms become more pervasive in spine surgery, internal and external validation frameworks are necessary to appraise model performance, ensuring the variability in different patient populations is reflected to enhance surgical precision.
CONCLUSION
The integration of AI and ML into spine surgery represents a transformative shift toward precision medicine, offering enhanced diagnostic and prognostic capabilities. With the advances in automated radiographic imaging, patient risk stratification, outcomes prediction, and personalized medicine, future work promises to tailor treatment to individual patients more accurately. Despite the promising achievements so far, the field must address challenges in data accuracy by expanding training datasets and implementing robust validation frameworks. As AI becomes more prevalent in spine surgery, successful integration has the power to refine surgical decision making and improve patient outcomes.
DECLARATIONS
Authors’ contributions
Original draft preparation, methodology, conceptualization: Turlip RW
Original draft presentation, conceptualization: Khela HS
Review and editing, supervision: Dagli MM, Ghenbot Y, Ahmad HS
Review and editing, validation: Chauhan D
Review and editing, supervision, conceptualization: Yoon JW
Critical writing: Turlip RW, Khela HS, Dagli MM, Chauhan D, Ghenbot Y, Ahmad HS, Yoon JW
Availability of data and materials
Not applicable.
Financial support and sponsorship
None.
Conflicts of interest
All authors declared that there are no conflicts of interest.
Ethical approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Copyright
© The Author(s) 2024.
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.